Locally running Mistral Chatbot with RAG

Estimated reading time: 8 minutes

Locally running Mistral Chatbot with RAG

Local Mistral Chatbot with (and FAQ)

Let’s implement a local running chatbot with Mistral using RAG to retrieve documents from a locally running that also contains FAQs.

Here’s a breakdown of the steps and the code to achieve this:

Phase 1: Setting Up the Local Environment

  1. Install Dependencies:
    pip install transformers sentencepiece accelerate chromadb langchain huggingface_hub PyPDF2
  2. Download Mistral LLM:
    from transformers import AutoTokenizer, AutoModelForCausalLM
     import torch
    
     model_name = "mistralai/Mistral-7B-Instruct-v0.2"
     tokenizer = AutoTokenizer.from_pretrained(model_name)
     model = AutoModelForCausalLM.from_pretrained(model_name)
    
     if torch..is_available():
      model = model.to("cuda")
    
     print(f"Mistral model '{model_name}' loaded successfully.")
     
  3. Prepare Your Documents and FAQs:
    • Documents: Place your documents (e.g., .txt, .pdf) in a local directory.
    • FAQs: Create a list of question-answer pairs. For example:
      faqs = [
        {"question": "What is your name?", "answer": "I am a helpful AI assistant powered by the Mistral model."},
        {"question": "How can I contact support?", "answer": "Please email support@example.com or call 1-800-SUPPORT."},
        # Add more FAQs here
        ]
        

Phase 2: Setting Up the Local Vector Database (ChromaDB)

  1. Load Documents and Create :
    from langchain.document_loaders import DirectoryLoader, PyPDFLoader, TextLoader
     from langchain.text_splitter import RecursiveCharacterTextSplitter
     from langchain.embeddings import HuggingFaceEmbeddings
     from langchain.vectorstores import Chroma
    
     # Load documents from a directory
     doc_path = "./local_documents"  # Replace with the path to your documents
     loader = DirectoryLoader(doc_path, glob="**/*.pdf", loader_cls=PyPDFLoader) # Example for PDFs
     # If you have mixed formats, you might need multiple loaders and combine the documents
     # For text files: loader = DirectoryLoader(doc_path, glob="**/*.txt", loader_cls=TextLoader)
     documents = loader.load()
    
     # Split documents into chunks
     text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)
     chunks = text_splitter.split_documents(documents)
    
     # Create embeddings using a suitable model (e.g., sentence-transformers)
     embeddings_model_name = "sentence-transformers/all-mpnet-base-v2"
     embeddings = HuggingFaceEmbeddings(model_name=embeddings_model_name)
    
     # Create Chroma vector store from document chunks
     persist_directory = "chroma_db"
     vectordb = Chroma.from_documents(documents=chunks, embedding=embeddings, persist_directory=persist_directory)
     vectordb.persist()
    
     print(f"Loaded {len(documents)} documents and created vector embeddings in '{persist_directory}'.")
     
  2. Add FAQ Embeddings to the Vector Database:
    faq_texts = [faq["question"] for faq in faqs]
     faq_metadatas = [{"source": "faq", "answer": faq["answer"]} for faq in faqs]
    
     faq_embeddings = embeddings.embed_documents(faq_texts)
    
     # Add FAQ embeddings and metadata to the existing ChromaDB
     faq_collection = vectordb.get_collection("faq_collection") # You can use a separate collection or the default
     faq_collection.add_embeddings(
      embeddings=faq_embeddings,
      metadatas=faq_metadatas,
      ids=[f"faq-{i}" for i in range(len(faqs))]
     )
    
     vectordb.persist()
     print(f"Added {len(faqs)} FAQs to the vector database.")
     

Phase 3: Implementing the RAG Chatbot

from langchain. import HuggingFacePipeline
 from langchain.chains import RetrievalQA

 # Load the persisted ChromaDB
 persist_directory = "chroma_db"
 embeddings_model_name = "sentence-transformers/all-mpnet-base-v2"
 embeddings = HuggingFaceEmbeddings(model_name=embeddings_model_name)
 vectordb = Chroma(persist_directory=persist_directory, embedding_function=embeddings)

 # Create a pipeline for the Mistral LLM
 pipeline = transformers.pipeline(
  "text-generation",
  model=model,
  tokenizer=tokenizer,
  torch_dtype=torch.float16 if torch.cuda.is_available() else torch.float32,
  device_map="auto",
  max_new_tokens=500,
  do_sample=True,
  temperature=0.7,
  top_p=0.95,
  top_k=50,
  repetition_penalty=1.15,
 )

 llm = HuggingFacePipeline(pipeline=pipeline)

 # Create a Retriever from the VectorDB
 retriever = vectordb.as_retriever(search_kwargs={"k": 4}) # Retrieve top 4 relevant chunks

 # Create the RetrievalQA chain
 rag_chain = RetrievalQA.from_llm(llm=llm, retriever=retriever, return_source_documents=True)

 def chatbot(query):
  """Processes the user query and returns an answer using RAG with FAQ prioritization."""

  # First, check for FAQ match
  faq_results = vectordb.similarity_search(query, k=3, filter={"source": "faq"}) # Search top 3 FAQs
  if faq_results and faq_results[0].metadata.get("answer"):
  # If a relevant FAQ is found with high similarity, return its answer directly
  if faq_results[0].score > 0.8: # Adjust the similarity score threshold as needed
  print("Answer from FAQ:")
  return faq_results[0].metadata["answer"]
  else:
  print("No highly relevant FAQ found, proceeding with document retrieval.")
  else:
  print("No FAQs found, proceeding with document retrieval.")

  # If no relevant FAQ is found, perform RAG on the documents
  rag_results = rag_chain({"query": query})
  print("Answer from Document Retrieval:")
  print("Source Documents:")
  for doc in rag_results["source_documents"]:
  print(f"- {doc.metadata.get('source')}") # Adjust based on your metadata

  return rag_results["result"]

 # Start the chatbot loop
 if __name__ == "__main__":
  print("Local Mistral Chatbot with RAG (and FAQ) is running. Type 'exit' to quit.")
  while True:
  user_query = input("You: ")
  if user_query.lower() == "exit":
  break
  response = chatbot(user_query)
  print(f"Bot: {response}")
 

Explanation:

  1. Environment Setup: We install the necessary libraries and load the Mistral LLM using transformers. Ensure you have a set up if you want faster inference.
  2. Document and FAQ Preparation: You need to organize your local documents and create a list of FAQs with questions and their corresponding answers.
  3. Vector Database (ChromaDB):
    • We load your documents using Langchain’s DirectoryLoader and appropriate loaders for your file types (e.g., PyPDFLoader for PDFs, TextLoader for .txt files).
    • The documents are split into smaller chunks using RecursiveCharacterTextSplitter to fit within the LLM’s context window and improve retrieval granularity.
    • HuggingFaceEmbeddings is used to generate vector embeddings for each chunk of your documents. You can choose different embedding models based on your needs. all-mpnet-base-v2 is a good general-purpose option.
    • A local ChromaDB instance is created and the document embeddings are stored in it. The persist_directory allows you to save the database to disk and load it later.
    • We then create embeddings for the FAQ questions and add them to the same (or a separate) collection in ChromaDB, along with metadata indicating they are from the “faq” source and including the answer.
  4. RAG Implementation:
    • We load the persisted ChromaDB.
    • A HuggingFacePipeline is created to wrap the local Mistral LLM, making it compatible with Langchain. We configure generation parameters like max_new_tokens, temperature, etc.
    • A Retriever is created from the ChromaDB, which will fetch the most relevant document chunks based on the user’s query.
    • RetrievalQA.from_llm creates the RAG chain. It takes the LLM and the retriever as input. When a query is passed to this chain, it retrieves relevant documents and then feeds them along with the query to the LLM to generate an answer.
  5. Chatbot Function with FAQ Prioritization:
    • The chatbot function takes the user’s query as input.
    • FAQ Check: It first performs a similarity search on the vector database, specifically filtering for entries from the “faq” source. If a highly similar FAQ question is found (based on a similarity score threshold you can adjust), it returns the pre-defined answer directly. This prioritizes providing quick answers to common questions.
    • Document Retrieval (if no good FAQ match): If no highly relevant FAQ is found, the function proceeds with the regular RAG process on the document embeddings using the rag_chain.
    • The function returns the generated answer and also prints the source documents used for retrieval.
  6. Chatbot Loop: The if __name__ == "__main__": block sets up a simple interactive loop where the user can type queries and receive responses from the chatbot.

To Run This:

  1. Create a local_documents folder in the same directory as your Python script and place your documents inside it.
  2. Save the code as a Python file (e.g., local_chatbot.py).
  3. Run the script from your terminal: python local_chatbot.py

This implementation provides a basic local chatbot using Mistral with RAG and prioritizes answering from a set of FAQs stored in the same local vector database as your documents. You can further customize this by:

  • Adjusting the similarity score threshold for FAQ matching.
  • Experimenting with different embedding models.
  • Tuning the LLM’s generation parameters.
  • Adding more sophisticated prompt engineering to guide the LLM’s responses.
  • Implementing more advanced document loading and splitting strategies.
  • Adding a user interface (e.g., using Gradio or Streamlit) for a more interactive experience.

Agentic AI (13) AI Agent (14) airflow (4) Algorithm (21) Algorithms (46) apache (28) apex (2) API (89) Automation (44) Autonomous (24) auto scaling (5) AWS (49) Azure (35) BigQuery (14) bigtable (8) blockchain (1) Career (4) Chatbot (17) cloud (94) cosmosdb (3) cpu (38) cuda (17) Cybersecurity (6) database (78) Databricks (6) Data structure (13) Design (66) dynamodb (23) ELK (2) embeddings (36) emr (7) flink (9) gcp (23) Generative AI (11) gpu (8) graph (36) graph database (13) graphql (3) image (39) indexing (26) interview (7) java (39) json (31) Kafka (21) LLM (16) LLMs (31) Mcp (1) monitoring (85) Monolith (3) mulesoft (1) N8n (3) Networking (12) NLU (4) node.js (20) Nodejs (2) nosql (22) Optimization (62) performance (175) Platform (78) Platforms (57) postgres (3) productivity (15) programming (47) pseudo code (1) python (54) pytorch (31) RAG (36) rasa (4) rdbms (5) ReactJS (4) redis (13) Restful (8) rust (2) salesforce (10) Spark (14) spring boot (5) sql (53) tensor (17) time series (12) tips (7) tricks (4) use cases (35) vector (49) vector db (2) Vertex AI (16) Workflow (35) xpu (1)

Leave a Reply