Skip to main content

Build a Document-Powered Chatbot with Langchain, Amazon Bedrock, and RAG


In this blog, I will demonstrate how to use Amazon Bedrock, Langchain LCEL, and the Retrieval Augmented Generation (RAG) framework to build a bare-bone chatbot that sources data from webpages.  We will be sourcing the web page data from, a specialized tech consulting company.  This process can extend to using other static documents like CSVs, PDFs, JSON, etc, as data sources for the chatbot.  You can learn more about document loading at

Before we get into the code, let’s go over some basics.

What is Generative AI?  Generative AI is the AI technology that generates content.  The content that is generated can be anything from text, code, and answers to questions to videos, songs, images, and everything in between.  The most popular generative AI tech right now is probably ChatGPT.  ChatGPT, Bard, and other “chat” AI technologies fall under a subset of generative AI called Large Language Models (LLM).   

What is an LLM?  LLMs or “Large Language Models”, are language models built using machine learning, NLP, and other statistical methods.  They infer relationships between words, sentences, phrases, etc. by training on large sets of text documents.  They are good at understanding semantics, syntax, and language patterns.  LLMs are capable of summarizing text, generating text, answering questions, code generation, and more.

What is Langchain? LangChain is an open-source framework that simplifies the development of applications that use LLMs.  LangChain is available for use with Python, JavaScript, and TypeScript.  Its library includes tools that allow developers to work easily with powerful LLMs, such as OpenAI's GPT-3.5 and GPT-4, to an array of external data sources to create and reap the benefits of natural language processing (NLP) applications.

What is LCEL?  LCEL (LangChain Expression Language) is a way to easily connect chains.  It utilizes LangChain’s “Runnable…” methods to control the input and output variables of each chain so that the pipeline flows smoothly.

What is Amazon Bedrock?  Amazon Bedrock is an Amazon service that makes many foundation LLMs available for use through a unified API.  There are different foundation models available depending on your use case.  For example, Amazon Titan is a family of models built by Amazon that are pre-trained on large datasets, which makes them powerful, general-purpose models.  Jurassic-2 is a family of multilingual large language models for text generation in Spanish, French, German, Portuguese, Italian, and Dutch.  Claude 2 is used for thoughtful dialogue, content creation, complex reasoning, creativity, and coding based on Constitutional AI and Harmlessness training.  So, depending on your use case of the LLM, there are a variety of LLMs to choose from.

What is Retrieval Augmented Generation (RAG)?  Retrieval Augmented Generation or “RAG” is a method of enriching LLMs with data.  It is used to create AI assistants that are capable of having discussions grounded in specialized enterprise knowledge.  This is accomplished by connecting these powerful but generic LLMs to internal knowledge bases of documents and by doing so, producing assistants that are domain-specific and more trustworthy.  The LangChain components that we’ll use for RAG are: 

  • document loader - Langchain libraries that help with loading files like csv, html, json, etc.
  • embedding model - Embedding is the process by which text, images, and audio are given numerical representation in a vector space. Embedding is usually performed by a machine learning model.  They are designed to be consumed by machine learning models and semantic search algorithms. They translate objects like these into a mathematical form according to the factors or traits each one may or may not have, and the categories they belong to.
  • vector store/database - A vector database/store is a special database that compactly stores a ton of high-dimensional vectors representing words and entities. Vector databases provide ultra-fast similarity search across these billions of vectors in real time.  The most common algorithms used to perform the similarity search are k-nearest neighbors (k-NN) or cosine similarity.
  • retriever - LangChain’s module for retrieving the loaded documents.

Let's jump into the code!

First, we initialize the boto3 client.  Take note of the service we are using, ‘bedrock-runtime’.  Other boto3 services are named with ‘bedrock’ in them, but we want ‘bedrock-runtime’ to be able to use the foundation models:

We will use Anthropic Claude’s family of LLMs, which have been designed to generate thoughtful dialogue.  There are parameters for the foundation models, most are similar but vary depending on which model you use.  

  • max_tokens_to_sample: Max token/word count that is generated in the response.
  • stop_sequences: Specify a character sequence to indicate where a model should stop.
  • temperature: Use a lower value to decrease randomness in the response.
  • top_p: Use a lower value to ignore less probable options.


We can now invoke this LLM with a question or prompt and see how it responds.  Since it is a general-purpose model that has been trained on a wide range of documents, It may not have specific information.  It will try its best to “understand” the question/prompt and provide an answer that is most related to the question given the knowledge base that it has been trained on.  Let’s test this by asking the LLM “Who is ippon?”.

We can see the response that the LLM has generated below.  It states that it is unsure of who or what “Ippon” refers to, but it does provide us with a definition of “Ippon”, although not the one we are interested in.  Depending on which model you use, you will receive different responses.  I have tried this with Amazon’s Titan family of models, which is more general purpose and it produced a “hallucination”.  “Hallucination” is a phenomenon that occurs with LLMs and refers to the generation of inaccurate, nonsensical, or detached text.

>>> print(response)
 Ippon is a Japanese martial arts term that refers to a perfect score or winning technique in judo or other grappling arts like jujitsu. Some key things to know about ippon:
- It is the highest score that can be awarded in a judo match. Scoring ippon results in an automatic win for that match.
- To score ippon, a throw must be completed with speed, power, and perfect technique. The opponent must be thrown flat on their back with control.
- Some throws that are more likely to score ippon include seoi-nage (shoulder throw), tomoe-nage (circle throw), and sumi-gaeshi (corner reversal). These throws demonstrate all the elements needed for an ippon score.
- Ippon can also be scored via pinning the opponent for 20 seconds or making them submit with a joint lock like an armbar. But throws are considered the "cleanest" way to win ippon.
- In competition, ippon is signaled by the referee raising one arm straight up in the air. It's the highest honor a judoka can achieve in a match.
So in summary, ippon refers to a perfect scoring technique, most commonly a well-executed throwing technique, in Japanese grappling arts like judo. It results in an immediate win in competition.

We need to give the LLM some context - some information it can “learn” so we can ask it specific questions.

For this example, we will load some web pages as documents to use as a knowledge base.  We will ingest some pages that describe who Ippon Technologies is and some of their success stories.  We will be using the BeautifulSoup4 package to crawl through their success stories page and get the links that we are interested in.

Now that we have the links, we will use LangChain’s WebBaseLoader to read and load these HTML pages.

Next, we need to index the data into a vectorstore.  This requires an embedding model and a vectorstore.  We will use LangChain’s BedrockEmbeddings module for the embedding model, split the documents into chunks, store them in the FAISS vectorstore, and create a retriever object.

We will create our first chain to query the retriever.  To query the retriever, we need to create a prompt that directs the LLM on what to do.

From the definition of “chain,” you can see that each “link” of the chain is connected by a “|”.  Each link will be used in the next link, whether it’s creating and passing variables or modifying outputs.  The first “link” in this chain is enclosed in curly brackets.  A pair of curly brackets within a chain is equivalent to LangChain’s “RunnableParallel”, a class that invokes components concurrently.  In this case, we are defining two variables, “context” and “question”.  These variables will be passed to the “prompt” link, which will then invoke the “llm” chain with the prompt, and, finally, format the output as a string using “StrOutputParser”.

Now, let’s invoke this chain with a question and see if it uses the documents we’ve loaded.

We can see the AI-generated response below and it actually returned what looks like a correct summary of the documents we loaded.  Just what we wanted.

>>> print(result)

 Based on the context provided, Ippon is a consulting and expertise firm that helps clients leverage their digital assets to design strategies and deploy transformations at scale. Some key details:

- Ippon is built up by more than 500 tech enthusiasts and supports more than 100 customers per year. 

- It has offices in 4 continents and over 500 collaborators. 

- Ippon helps clients with digital transformation, innovation challenges, product development, and scaling technologies. 

- It works in areas like strategy, technology, and transformation to help clients accelerate growth.

So in summary, Ippon appears to be a global consulting firm that partners with organizations to help them transform digitally and scale their technologies.

Say we wanted to ask the LLM to tell us more about the “third” point it listed.  Will it know which of the points that it generated is the “third” one?

From the response below, we can see that it generated an answer.  We can tell that it understood part of the question because it is giving us information about a “third point”, but not the entire question because the response is for a different topic entirely.

>>> print(result)

 Based on the context provided, the third point listed in the passage about the Legacy Platform Build v. Buy Assessment success story is:

< 3 months 

to learn, analyze and craft 3-5y roadmap

This point indicates that Ippon was able to learn about the client's business, analyze their needs and options, and craft a 3-5 year roadmap for their platform modernization efforts, all within less than 3 months. No other details are provided about this third point.

What happened?  The LLM has information from the webpages that we embedded but it doesn’t have any context to our chat conversation.  So, when we ask it about anything that requires knowledge or context from our current chat, it won’t know and produce a hallucination (you can tell the LLM to return a specific string like “I do not know the answer” based on conditions using prompt engineering).  

This is where chat memory comes into play.  We can add a memory component to our LLM object.  The memory component will store the chat history and use it as part of the context for the next question/prompt.  The memory module that we will use from LangChain is ConversationBufferMemory, which will store the chat history in memory and use it as a variable.  We will also write a prompt so the LLM has some direction and instructions when invoking.  The prompt will take in the chat history and the question that is asked and form a new question that one can understand without the chat history, a “standalone question”.  We will use the memory object in the “loaded_memory” chain, which will become a link in a larger chain.

Next, we will create the “standalone_question” chain.  We will create a variable called “standalone_question” by defining the variables for “question” and “chat_history” that will be passed to the “CONDENSE_QUESTION_PROMPT”, which will then invoke the “llm”, and output a string, which should be a standalone question.

The next chain we will create is the “retrieved_documents” chain, which will use the standalone question and the retriever to pull documents that are relevant to the standalone question.

For our next chain, we will be using a custom function to combine the documents into a single string.  We will be producing two variables, “context” and “question”, that will be used in the next chain.

The next piece we will create will be the “answer” chain, which will utilize the retrieved documents to answer the question.  We will create a prompt that tells the LLM to only use the context (retrieved documents) to answer the question.

Now we can link all the chains we created into a final chain.

For the memory to work, we have to code it to save our messages.  Let’s create a function that handles the invoking of the bot and storing the messages.

Let’s test the knowledge base to make sure it’s using the embedded documents.  Let’s ask it “Who is Ippon?”


>>> tell_me(q)
 Based on the provided context, Ippon is a consulting and expertise firm that supports the digital transformation of companies. Some key details:
- Ippon has over 500 collaborators and operates across 4 continents. 
- It helps organizations design digital strategies and deploy roadmaps at scale to quickly deliver value. 
- It provides expertise and capabilities related to strategy, technology, and transformation, focusing on areas like enterprise architecture, product strategy, cloud, data platforms, DevSecOps, agility, etc. 
- It partners with technology companies like AWS, Snowflake, and Databricks to help clients leverage their platforms.
- Ippon's goal is to accelerate clients' digital transformation and help implement the latest innovation and modernization methodologies. 
- It has experience delivering projects for clients across industries like payments, finance, utilities, and more.
So in summary, Ippon is an international consulting and expertise firm that supports organizations' digital transformations through strategic consulting, technology implementation, and transformation services.

The response looks good!  Now let’s test the chat memory.  We’ll ask the bot a question that refers to our conversation.  Let’s ask it to tell us more about the “third” point.  If it works, it should tell us about Ippon’s “expertise and capabilities related to strategy, technology, and transformation…”.


Based on the context provided, here are some additional details about Ippon's capabilities related to transformation:

- DevSecOps: Ippon helps organizations implement DevSecOps practices and principles to accelerate delivery of secure software through automation, integration, and collaboration between development, security, and operations teams. 

- Agility: Ippon coaches organizations on adopting agile methodologies like Scrum and Kanban to improve flexibility, continuous improvement, and ability to manage changing priorities. 

- Craftsmanship: Ippon promotes software craftsmanship principles like simplicity, transparency, and technical excellence to build high-quality solutions that are a pleasure to work with and easy to maintain over time.

- Modernization: Ippon assists with modernizing legacy systems and technical debt by re-platforming to cloud-native architectures, refactoring code, and re-engineering for new business needs.

- Innovation: Ippon helps unlock innovation through design thinking, rapid prototyping, and experimentation to identify new products/services and business models.

- Industrialization: Ippon supports industrializing new practices, tools, and workflows to scale transformations across the organization through change management and capability building.

From the response, we can see that the bot understood the context of the “third” point.  It now has loaded documents to use as context and has chat history, so, it is working as expected!  

In this blog, I demonstrated how to build a document-sourced chatbot.  I can see this being utilized as an internal AI assistant that can answer company-related questions for employees, a customer service chatbot that can answer frequently asked questions, or any other type of assistant that is full of information.  That’s just off the top of my head.  I believe there are use cases that we haven’t even thought about yet.  There are also many other things that I’d like to dive deeper into that may improve the bot such as model parameter tuning, document chunking, prompt engineering, and embedding models.

Check out these related blogs for more information:
Querying Snowflake With LangChain and AWS Bedrock

Supercharge LLMs With Your Own Data With AWS Bedrock Knowledge Base



Will Han
Post by Will Han
March 14, 2024
Will is a data engineer at Ippon Technologies.