This article provides a comprehensive guide to implementing a backend-only Retrieval-Augmented Generation (RAG) system enhanced with Multi-Hop Retrieval capabilities. This advanced technique, leveraging LangChain’s SelfQueryRetriever
, OpenAI’s language models and embeddings, and ChromaDB for vector storage, enables more sophisticated question answering over a knowledge base.
Understanding Multi-Hop Retrieval in RAG
Traditional RAG systems retrieve relevant documents based on the similarity of the query to the document content. However, answering complex questions often requires synthesizing information from multiple documents that are not directly semantically related to the query but are related to each other. Multi-Hop Retrieval addresses this by allowing the system to perform a sequence of retrieval steps, “hopping” from one relevant piece of information to another to gather the necessary context for answering the question.
For instance, to answer “Which movie directed by someone who won an Oscar for Best Director stars an actor who has also won an Oscar for Best Actor?”, the system needs to:
- Find directors who have won an Oscar for Best Director.
- Retrieve movies directed by those individuals.
- Identify actors in those movies.
- Check if those actors have won an Oscar for Best Actor.
- Finally, identify the movie that satisfies all these conditions.
This multi-step reasoning process is facilitated by the SelfQueryRetriever
, which can understand the underlying entities and relationships within a query and translate them into a series of retrieval operations based on document metadata.
Core Components and Implementation
1. Knowledge Ingestion: Preparing the Data
The first crucial step is to ingest and process the knowledge base. This involves loading the documents, splitting them into manageable chunks, generating embeddings, and crucially, extracting and enriching metadata.
import os
from dotenv import load_dotenv
from langchain.document_loaders import DirectoryLoader, TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import Chroma
# Load environment variables
load_dotenv()
openai_api_key = os.getenv("OPENAI_API_KEY")
embeddings = OpenAIEmbeddings(openai_api_key=openai_api_key)
# Define paths
KNOWLEDGE_BASE_PATH = "knowledge_base"
DB_PATH = "chroma_db"
# Load documents from a directory
def load_documents(path):
loader = DirectoryLoader(path, glob="**/*.txt", loader_cls=TextLoader)
documents = loader.load()
return documents
# Split documents into smaller chunks
def chunk_documents(documents, chunk_size=1000, chunk_overlap=100):
text_splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap)
chunks = text_splitter.split_documents(documents)
return chunks
# Extract and enrich metadata from document content
def enrich_metadata(chunk):
metadata = chunk.metadata
content = chunk.page_content.lower()
# Example metadata extraction based on keywords
if "title:" in content:
try:
metadata["title"] = content.split("title:")[1].split("\n")[0].strip()
except:
pass
if "director:" in content:
try:
metadata["director"] = content.split("director:")[1].split("\n")[0].strip()
except:
pass
# ... (more metadata extraction logic for actor, genre, award, year, etc.) ...
return metadata
# Create a Chroma vector database from the chunks and metadata
def create_vectorstore(chunks, embeddings, db_path):
enriched_chunks = []
for chunk in chunks:
metadata = enrich_metadata(chunk)
enriched_chunks.append({"page_content": chunk.page_content, "metadata": metadata})
vectordb = Chroma.from_texts(
texts=[c["page_content"] for c in enriched_chunks],
metadatas=[c["metadata"] for c in enriched_chunks],
embedding=embeddings,
persist_directory=db_path
)
vectordb.persist()
return vectordb
if __name__ == "__main__":
documents = load_documents(KNOWLEDGE_BASE_PATH)
chunks = chunk_documents(documents)
vectordb = create_vectorstore(chunks, embeddings, DB_PATH)
print(f"Vector database created and persisted at {DB_PATH}")
The enrich_metadata
function is critical. It parses the content of each document chunk and extracts relevant information like titles, directors, actors, genres, awards, and years. This metadata is then associated with the corresponding vector embeddings in ChromaDB. The effectiveness of multi-hop retrieval heavily relies on the accuracy and comprehensiveness of this extracted metadata.
2. Multi-Hop RAG Engine: Leveraging `SelfQueryRetriever`
The core of the multi-hop RAG system is the SelfQueryRetriever
. This LangChain component can interpret the user’s query to understand which metadata fields are relevant and how to filter the vector database to retrieve the necessary information.
import os
from dotenv import load_dotenv
from langchain.vectorstores import Chroma
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.llms import OpenAI
from langchain.chains import RetrievalQA
from langchain.retrievers.self_query import SelfQueryRetriever, DocumentContents
from langchain.chains.query_constructor.base import AttributeInfo
# Load environment variables
load_dotenv()
openai_api_key = os.getenv("OPENAI_API_KEY")
embeddings = OpenAIEmbeddings(openai_api_key=openai_api_key)
llm = OpenAI(openai_api_key=openai_api_key)
DB_PATH = "chroma_db"
# Load the persisted Chroma vector database
def load_vectorstore(embeddings, db_path):
vectordb = Chroma(persist_directory=db_path, embedding_function=embeddings)
return vectordb
# Create the RAG chain with SelfQueryRetriever
def create_multihop_rag_chain_self_query(llm, vectordb):
# Define the metadata fields and their descriptions
metadata_field_info = [
AttributeInfo(name="title", type="string", description="The title of a movie or book"),
AttributeInfo(name="director", type="string", description="The director of a movie"),
AttributeInfo(name="actor", type="string", description="An actor in a movie"),
AttributeInfo(name="genre", type="string", description="The genre of a movie or book"),
AttributeInfo(name="award", type="string", description="An award won by a movie or person"),
AttributeInfo(name="award_category", type="string", description="The category of the award"),
AttributeInfo(name="year", type="integer", description="The release year of a movie or publication year of a book"),
]
# Describe the content of the documents
document_content_description = "Information about a movie, actor, director, or award"
# Initialize the SelfQueryRetriever
retriever = SelfQueryRetriever.from_llm(
llm=llm,
vectorstore=vectordb,
document_contents=document_content_description,
metadata_field_info=metadata_field_info,
search_kwargs={"k": 5}, # Number of documents to retrieve in each step
)
# Create the RetrievalQA chain
rag_chain = RetrievalQA.from_llm(llm=llm, retriever=retriever, return_source_documents=True)
return rag_chain
if __name__ == "__main__":
vectordb = load_vectorstore(embeddings, DB_PATH)
rag_chain = create_multihop_rag_chain_self_query(llm, vectordb)
query = "Which movie directed by someone who won an Oscar for Best Director stars an actor who has also won an Oscar for Best Actor?"
result = rag_chain({"query": query})
print(f"Question: {query}")
print(f"Answer: {result['result']}")
print("\nSource Documents:")
for doc in result['source_documents']:
print(f"- {doc.page_content[:100]}...")
The metadata_field_info
list defines the schema of our metadata, specifying the name, data type, and a human-readable description for each field. This information is used by the language model to understand how to query the vector database based on the user’s input. The document_content_description
provides context about the information contained within the documents.
The SelfQueryRetriever
takes the language model (LLM), the vector store, the document content description, and the metadata field information as input. When a query is passed to the RetrievalQA
chain, the SelfQueryRetriever
first generates a structured query based on the metadata fields and then uses this query to fetch relevant documents from the vector store. This allows for filtering based on specific attributes (e.g., finding documents where award
is “Oscar” and award_category
is “Best Director”).
3. Dependencies and Setup
To run this implementation, you need to install the following Python libraries:
langchain
openai
chromadb
tiktoken
python-dotenv
You also need to set up your OpenAI API key as an environment variable in a .env
file:
OPENAI_API_KEY="YOUR_OPENAI_API_KEY"
Finally, create a knowledge_base
directory and populate it with your text files containing the information you want to query. The format of these files should ideally include the metadata keywords (e.g., “Title:”, “Director:”, “Actor:”, etc.) that the enrich_metadata
function expects.
Conclusion and Further Considerations
This detailed explanation provides a solid foundation for implementing a backend-only advanced RAG system with Multi-Hop Retrieval. By leveraging the SelfQueryRetriever
and well-structured metadata, you can enable your chatbot to answer complex, multi-faceted questions that go beyond simple keyword matching.
Further improvements and considerations include:
- More Sophisticated Metadata Extraction: The current implementation uses basic keyword matching. For more complex documents, consider using Named Entity Recognition (NER) models to automatically extract entities and relationships.
- Handling Ambiguous Queries: The
SelfQueryRetriever
relies on the LLM to understand the query. For ambiguous queries, you might need to implement additional logic for query clarification or multiple retrieval strategies. - Optimizing Retrieval: Experiment with different values for
search_kwargs
(e.g., increasingk
) to retrieve more documents in each step, potentially improving recall but also increasing computational cost. - Evaluating Performance: Thoroughly evaluate the performance of your multi-hop RAG system with a diverse set of complex questions to identify areas for improvement.
Leave a Reply