Estimated reading time: 8 minutes
Local Mistral Chatbot with RAG (and FAQ)
Let’s implement a local running chatbot with Mistral LLM using RAG to retrieve documents from a locally running Vector DB that also contains FAQs.
Here’s a breakdown of the steps and the Python code to achieve this:
Phase 1: Setting Up the Local Environment
- Install Dependencies:
pip install transformers sentencepiece accelerate chromadb langchain huggingface_hub PyPDF2
- transformers: For loading and running the Mistral LLM.
- sentencepiece: Required by some tokenizers.
- accelerate: For efficient inference, especially on GPUs.
- chromadb: Our local vector database.
- langchain: A framework for building LLM applications, including RAG.
- huggingface_hub: To download the Mistral model.
- PyPDF2: To load PDF documents (if your documents are in PDF format).
- Download Mistral LLM:
from transformers import AutoTokenizer, AutoModelForCausalLM import torch model_name = "mistralai/Mistral-7B-Instruct-v0.2" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForCausalLM.from_pretrained(model_name) if torch.cuda.is_available(): model = model.to("cuda") print(f"Mistral model '{model_name}' loaded successfully.")
- Prepare Your Documents and FAQs:
- Documents: Place your documents (e.g.,
.txt
,.pdf
) in a local directory. - FAQs: Create a list of question-answer pairs. For example:
faqs = [ {"question": "What is your name?", "answer": "I am a helpful AI assistant powered by the Mistral model."}, {"question": "How can I contact support?", "answer": "Please email support@example.com or call 1-800-SUPPORT."}, # Add more FAQs here ]
- Documents: Place your documents (e.g.,
Phase 2: Setting Up the Local Vector Database (ChromaDB)
- Load Documents and Create Embeddings:
from langchain.document_loaders import DirectoryLoader, PyPDFLoader, TextLoader from langchain.text_splitter import RecursiveCharacterTextSplitter from langchain.embeddings import HuggingFaceEmbeddings from langchain.vectorstores import Chroma # Load documents from a directory doc_path = "./local_documents" # Replace with the path to your documents loader = DirectoryLoader(doc_path, glob="**/*.pdf", loader_cls=PyPDFLoader) # Example for PDFs # If you have mixed formats, you might need multiple loaders and combine the documents # For text files: loader = DirectoryLoader(doc_path, glob="**/*.txt", loader_cls=TextLoader) documents = loader.load() # Split documents into chunks text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100) chunks = text_splitter.split_documents(documents) # Create embeddings using a suitable model (e.g., sentence-transformers) embeddings_model_name = "sentence-transformers/all-mpnet-base-v2" embeddings = HuggingFaceEmbeddings(model_name=embeddings_model_name) # Create Chroma vector store from document chunks persist_directory = "chroma_db" vectordb = Chroma.from_documents(documents=chunks, embedding=embeddings, persist_directory=persist_directory) vectordb.persist() print(f"Loaded {len(documents)} documents and created vector embeddings in '{persist_directory}'.")
- Add FAQ Embeddings to the Vector Database:
faq_texts = [faq["question"] for faq in faqs] faq_metadatas = [{"source": "faq", "answer": faq["answer"]} for faq in faqs] faq_embeddings = embeddings.embed_documents(faq_texts) # Add FAQ embeddings and metadata to the existing ChromaDB faq_collection = vectordb.get_collection("faq_collection") # You can use a separate collection or the default faq_collection.add_embeddings( embeddings=faq_embeddings, metadatas=faq_metadatas, ids=[f"faq-{i}" for i in range(len(faqs))] ) vectordb.persist() print(f"Added {len(faqs)} FAQs to the vector database.")
Phase 3: Implementing the RAG Chatbot
from langchain.llms import HuggingFacePipeline
from langchain.chains import RetrievalQA
# Load the persisted ChromaDB
persist_directory = "chroma_db"
embeddings_model_name = "sentence-transformers/all-mpnet-base-v2"
embeddings = HuggingFaceEmbeddings(model_name=embeddings_model_name)
vectordb = Chroma(persist_directory=persist_directory, embedding_function=embeddings)
# Create a pipeline for the Mistral LLM
pipeline = transformers.pipeline(
"text-generation",
model=model,
tokenizer=tokenizer,
torch_dtype=torch.float16 if torch.cuda.is_available() else torch.float32,
device_map="auto",
max_new_tokens=500,
do_sample=True,
temperature=0.7,
top_p=0.95,
top_k=50,
repetition_penalty=1.15,
)
llm = HuggingFacePipeline(pipeline=pipeline)
# Create a Retriever from the VectorDB
retriever = vectordb.as_retriever(search_kwargs={"k": 4}) # Retrieve top 4 relevant chunks
# Create the RetrievalQA chain
rag_chain = RetrievalQA.from_llm(llm=llm, retriever=retriever, return_source_documents=True)
def chatbot(query):
"""Processes the user query and returns an answer using RAG with FAQ prioritization."""
# First, check for FAQ match
faq_results = vectordb.similarity_search(query, k=3, filter={"source": "faq"}) # Search top 3 FAQs
if faq_results and faq_results[0].metadata.get("answer"):
# If a relevant FAQ is found with high similarity, return its answer directly
if faq_results[0].score > 0.8: # Adjust the similarity score threshold as needed
print("Answer from FAQ:")
return faq_results[0].metadata["answer"]
else:
print("No highly relevant FAQ found, proceeding with document retrieval.")
else:
print("No FAQs found, proceeding with document retrieval.")
# If no relevant FAQ is found, perform RAG on the documents
rag_results = rag_chain({"query": query})
print("Answer from Document Retrieval:")
print("Source Documents:")
for doc in rag_results["source_documents"]:
print(f"- {doc.metadata.get('source')}") # Adjust based on your metadata
return rag_results["result"]
# Start the chatbot loop
if __name__ == "__main__":
print("Local Mistral Chatbot with RAG (and FAQ) is running. Type 'exit' to quit.")
while True:
user_query = input("You: ")
if user_query.lower() == "exit":
break
response = chatbot(user_query)
print(f"Bot: {response}")
Explanation:
- Environment Setup: We install the necessary libraries and load the Mistral LLM using
transformers
. Ensure you have a GPU set up if you want faster inference. - Document and FAQ Preparation: You need to organize your local documents and create a list of FAQs with questions and their corresponding answers.
- Vector Database (ChromaDB):
- We load your documents using Langchain’s
DirectoryLoader
and appropriate loaders for your file types (e.g.,PyPDFLoader
for PDFs,TextLoader
for.txt
files). - The documents are split into smaller chunks using
RecursiveCharacterTextSplitter
to fit within the LLM’s context window and improve retrieval granularity. HuggingFaceEmbeddings
is used to generate vector embeddings for each chunk of your documents. You can choose different embedding models based on your needs.all-mpnet-base-v2
is a good general-purpose option.- A local ChromaDB instance is created and the document embeddings are stored in it. The
persist_directory
allows you to save the database to disk and load it later. - We then create embeddings for the FAQ questions and add them to the same (or a separate) collection in ChromaDB, along with metadata indicating they are from the “faq” source and including the answer.
- We load your documents using Langchain’s
- RAG Implementation:
- We load the persisted ChromaDB.
- A
HuggingFacePipeline
is created to wrap the local Mistral LLM, making it compatible with Langchain. We configure generation parameters likemax_new_tokens
,temperature
, etc. - A
Retriever
is created from the ChromaDB, which will fetch the most relevant document chunks based on the user’s query. RetrievalQA.from_llm
creates the RAG chain. It takes the LLM and the retriever as input. When a query is passed to this chain, it retrieves relevant documents and then feeds them along with the query to the LLM to generate an answer.
- Chatbot Function with FAQ Prioritization:
- The
chatbot
function takes the user’s query as input. - FAQ Check: It first performs a similarity search on the vector database, specifically filtering for entries from the “faq” source. If a highly similar FAQ question is found (based on a similarity score threshold you can adjust), it returns the pre-defined answer directly. This prioritizes providing quick answers to common questions.
- Document Retrieval (if no good FAQ match): If no highly relevant FAQ is found, the function proceeds with the regular RAG process on the document embeddings using the
rag_chain
. - The function returns the generated answer and also prints the source documents used for retrieval.
- The
- Chatbot Loop: The
if __name__ == "__main__":
block sets up a simple interactive loop where the user can type queries and receive responses from the chatbot.
To Run This:
- Create a
local_documents
folder in the same directory as your Python script and place your documents inside it. - Save the code as a Python file (e.g.,
local_chatbot.py
). - Run the script from your terminal:
python local_chatbot.py
This implementation provides a basic local chatbot using Mistral with RAG and prioritizes answering from a set of FAQs stored in the same local vector database as your documents. You can further customize this by:
- Adjusting the similarity score threshold for FAQ matching.
- Experimenting with different embedding models.
- Tuning the LLM’s generation parameters.
- Adding more sophisticated prompt engineering to guide the LLM’s responses.
- Implementing more advanced document loading and splitting strategies.
- Adding a user interface (e.g., using Gradio or Streamlit) for a more interactive experience.
Leave a Reply