Creating an AI Blog Post Generator with LangChain: A Step-by-Step Guide
There is an AI content creation tool called copy.ai that has a concept of workflows. A workflow is basically a series of pre-determined steps that are carried out with the aim of achieving an end goal using AI. The workflow that impressed me the most was the SEO(Search engine optimization) Optimized Blog Generator workflow, which basically searches the web, picks the top 3 search results, analyses their structure, generates questions, blog title, blog outline and finally the final blog. And I thought this would be a great opportunity to use LangChain to orchestrate this workflow with the tools available to us through LangChain.
In this blog I will write about the steps I took to create the AI blog writing agent, the architecture of the blog agent and then explain the different decisions I had to make along the way.
Architecture
The agent consists of the following main components
- Document loading: The first step, which is the most necessary, is to perform an Internet search and then to load the retrieved texts from the Internet into a document using a DocumentLoader.
- Splitting the documents: The next step is to split the documents into smaller chunks. Splitting documents allows us to search faster and also allows us to fit as much relevant data into the context window of the LLM that we are going to use.
- Storage: The split documents need to be stored in a Vector database/Vector store. But before we save the document chunks, we first need to embed them using OpenAIEmbeddings, and then we can store the embeddings in our vector store. The embeddings are easily searchable because we can easily compute similarity scores, which allows semantic searching across the document chunks.
4. Retrieval: When a user enters keywords to generate a blog post, we want to be able to retrieve the necessary chunks of data from the vector store that are relevant to the keywords. This is enabled by the retriever, which helps to extract information from the vector store given the keyword.
5. Blog generation: To generate the blog post, we will pass the retrieved document chunks and the user keyword/keywords as a prompt to the LLM, which will generate the final blog post.
Requirements
We will be using Python
and LangChain
to build this AI workflow, so the first steps to get started should be to install the necessary packages using pip
, a Python package manager.
It would be useful to first create a virtual environment in your project directory and enable it.
On Windows you can run the following command to create a virtual environment, please look up the command to create a Python virtual environment depending on the operating system you are using.
python -m venv venv
To activate the virtual environment, run the following commands
venv/scripts/activate
And then you want to install the necessary packages for the project. You can do this with the following commands
pip install langchain_community langchain langchain_core langchain_openai langchain_text_splitters bs4
Implementation
In your main.py file,
1. Internet Search
The first step in our process is to perform an internet search. We will use the DuckDuckGo Search API to do this. Using the user’s keyword(s), we will perform a search, retrieve the top 3 search results, retrieve the links to the search results, and add the links to a list that the get_links
function returns.
from langchain_community.tools import DuckDuckGoSearchResults
from langchain_community.utilities.duckduckgo_search import DuckDuckGoSearchAPIWrapper
import re
keyword = "<input_your_own_keyword/keywords>"
os.environ['OPENAI_API_KEY'] = "<your_openai_api_key>"
def get_links(keyword):
wrapper = DuckDuckGoSearchAPIWrapper(max_results=3)
search = DuckDuckGoSearchResults(api_wrapper=wrapper)
results = search.run(tool_input=keyword)
links = []
parsed_links = re.findall(r'link:\s*(https?://[^\],\s]+)', results)
for link in parsed_links:
links.append(link)
return links
2. Document loading
In this step, we will use the BeautifulSoup Strainer to filter out only the text content from the pages for the links we get from the above step, and then we will use the WebBaseLoader document loader to load the text content from the top 3 search results we got above.
from langchain_community.document_loaders import WebBaseLoader
import bs4
bs4_strainer = bs4.SoupStrainer(('p', 'h1', 'h2', 'h3', 'h4', 'h5', 'h6'))
document_loader = WebBaseLoader(web_path=(get_links(keyword)))
docs = document_loader.load()
3. Document splitting
The next step is to break the documents into smaller chunks for the purposes we defined earlier. This is for faster, more efficient searching and to fit relevant context into the token window of the LLM(Large Language Model) we are going to use. To do this, we will use the RecursiveCharacterTextSplitter
from langchain_text_splitters import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(chunk_size=1000,chunk_overlap=200,add_start_index=True,)
splits = splitter.split_documents(docs)
4. Vector Storage
Here we then store the document splits in a Chroma vector store. During search/retrieval, the query input is also embedded and compared to the elements in the vector database for similarity. This comparison is made by calculating the cosine angles between the vector embeddings and is one of the simplest methods of finding similarities between information.
from langchain_chroma import Chroma
from langchain_openai import OpenAIEmbeddings
vector_store = Chroma.from_documents(documents=splits, embedding=OpenAIEmbeddings())
An embedding is a lower level representation of an information for example representing words as a vector.
5. Retriever
We can convert any VectorStore
into a Retriever
with as_retriever()
. LangChain provides a Retriever
interface which wraps an index that can return relevant Documents
given a string query.
retriever = vector_store.as_retriever(search_type="similarity", search_kwards={"k": 10})
6. Blog generation
After retrieving the relevant documents, we can pass the retrieved documents and the keywords entered by the user as prompts to ChatOpenAI and then generate the blog.
template = """
Given the following information, generate a blog post
Write a full blog post that will rank for the following keywords: {keyword}
Instructions:
The blog should be properly and beautifully formatted using markdown.
The blog title should be SEO optimized.
The blog title, should be crafted with the keyword in mind and should be catchy and engaging. But not overly expressive.
Each sub-section should have at least 3 paragraphs.
Each section should have at least three subsections.
Sub-section headings should be clearly marked.
Clearly indicate the title, headings, and sub-headings using markdown.
Each section should cover the specific aspects as outlined.
For each section, generate detailed content that aligns with the provided subtopics. Ensure that the content is informative and covers the key points.
Ensure that the content flows logically from one section to another, maintaining coherence and readability.
Where applicable, include examples, case studies, or insights that can provide a deeper understanding of the topic.
Always include discussions on ethical considerations, especially in sections dealing with data privacy, bias, and responsible use. Only add this where it is applicable.
In the final section, provide a forward-looking perspective on the topic and a conclusion.
Please ensure proper and standard markdown formatting always.
Make the blog post sound as human and as engaging as possible, add real world examples and make it as informative as possible.
You are a professional blog post writer and SEO expert.
Context: {context}
Blog:
"""
llm = ChatOpenAI()
prompt = PromptTemplate.from_template(template=template)
chain = (
{"context": retriever | "".join(doc.page_content for doc in docs), "keyword": RunnablePassthrough()}
| prompt
| llm
| StrOutputParser()
)
response = chain.invoke(input=keyword)
print(response)
And there you have your blog, generated entirely by AI using a basic RAG (Retrieval Augmented Generation) architecture and powered by the internet for more context.
You can save the answer as a markdown file to see the result in all it’s markdown formatting elegance. You can add the following function that you can call to save the answer.
def save_file(content, filename):
directory = "blogs"
if not os.path.exists(directory):
os.makedirs(directory)
filepath = os.path.join(directory, filename)
with open(filepath, 'w') as f:
f.write(content)
print(f" 🥳 File saved as {filepath}")
And then you can call the save_file
function like this to save the generated response in a directory called blogs in your project directory.
save_file(content=response, filename=keyword+".md")
To run the script, enter your OpenAI API key and then a keyword at the top of your file as specified in step 1. Then run the following command in your terminal
python main.py
And that is it, you will be able to open the generated markdown file in your blogs directory and you will be able to preview the generated blog post formatted in markdown.
And that is all for today, you can check out the source code for the blog generator I made, I used StreamLit to build the UI, which I think is also an interesting way to get your AI agents to have a user interface and allow people to use them and try them out.
Here is a demo:
Try it out here: https://ai-blog-post-generator.streamlit.app/
Source code: https://github.com/jordan-jakisa/blog_post_writer
Please follow me for more content like this!
Tschüss, bis später 🤓