k-NN (k-Nearest Neighbors) search in OpenSearch

To perform a k-NN (k-Nearest Neighbors) search in OpenSearch after loading your manuals (or any documents) as vector embeddings, you’ll use the knn query within the OpenSearch search . Here’s how you can do it:

Understanding the knn Query

The knn query in OpenSearch allows you to find the k most similar vectors to a query vector based on a defined distance metric (like Euclidean distance or cosine similarity).

Steps to Perform a k-NN Search:

  1. Identify the Vector Field: You need to know the name of the field in your OpenSearch index that contains the vector embeddings of your manual chunks (e.g., "embedding" as used in the previous examples).
  2. Construct the Search Query: You’ll create a JSON request to the OpenSearch _search endpoint, using the knn query type.
  3. Specify the Query Vector: Within the knn query, you’ll provide the vector you want to find similar vectors to. This query vector should have the same dimensionality as the vectors in your index. You’ll likely generate this query vector by embedding the user’s search query using the same embedding model you used for your manuals.
  4. Define k: You need to specify the number of nearest neighbors (k) you want OpenSearch to return.

Example using the OpenSearch Client:

Assuming you have the OpenSearch Python client initialized (os_client) as in the previous code snippets, here’s how you can perform a k-NN search:

Python

def perform_knn_search(index_name, query_vector, k=3):
    """
    Performs a k-NN search on the specified OpenSearch index.

    Args:
        index_name (str): The name of the OpenSearch index.
        query_vector (list): The vector to search for nearest neighbors of.
        k (int): The number of nearest neighbors to return.

    Returns:
        list: A list of the top k matching documents (hits).
    """
    search_query = {
        "size": k,  # Limit the number of results to k (can be different from k in knn)
        "query": {
            "knn": {
                "embedding": {  # Replace "embedding" with the actual name of your vector field
                    "vector": query_vector,
                    "k": k
                }
            }
        }
    }

    try:
        response = os_client.search(index=index_name, body=search_query)
        hits = response['hits']['hits']
        return hits
    except Exception as e:
        print(f"Error performing k-NN search: {e}")
        return []

# --- Example Usage ---
if __name__ == "__main__":
    # Assuming you have a user query
    user_query = "How do I troubleshoot a connection issue?"

    # Generate the embedding for the user query using the same model
    from transformers import AutoTokenizer, AutoModel
    embedding_model_name = "sentence-transformers/all-mpnet-base-v2"
    embedding_tokenizer = AutoTokenizer.from_pretrained(embedding_model_name)
    embedding_model = AutoModel.from_pretrained(embedding_model_name)

    def get_query_embedding(text, tokenizer, model):
        inputs = tokenizer(text, padding=True, truncation=True, return_tensors="pt")
        outputs = model(**inputs)
        return outputs.last_hidden_state.mean(dim=1).detach().numpy().tolist()[0]

    query_embedding = get_query_embedding(user_query, embedding_tokenizer, embedding_model)

    # Perform the k-NN search
    search_results = perform_knn_search(OPENSEARCH_INDEX_NAME, query_embedding, k=3)

    if search_results:
        print(f"Top {len(search_results)} most relevant manual snippets for query: '{user_query}'")
        for hit in search_results:
            print(f"  Score: {hit['_score']}")
            print(f"  Content: {hit['_source']['content'][:200]}...") # Display first 200 characters
            print("-" * 20)
    else:
        print("No relevant manual snippets found.")

Explanation of the Code:

  1. perform_knn_search Function:
    • Takes the index_name, query_vector, and the desired number of neighbors k as input.
    • Constructs the OpenSearch search query with the knn clause.
    • The vector field within knn specifies the query vector.
    • The k field within knn specifies the number of nearest neighbors to retrieve.
    • The size parameter in the top-level query controls the total number of hits returned by the search (it’s good practice to set it to at least k).
    • Executes the search using os_client.search().
    • Returns the hits array from the response, which contains the matching documents.
  2. Example Usage (if __name__ == "__main__":)
    • Defines a sample user_query.
    • Loads the same Sentence Transformer model used for embedding the manuals to generate an embedding for the user_query.
    • Calls the perform_knn_search function with the index name, the generated query embedding, and the desired number of results (k=3).
    • Prints the retrieved search results, including their score and a snippet of the content.

Key Considerations:

  • Embedding Model Consistency: Ensure that you use the same embedding model to generate the query embeddings as you used to embed your manuals. Inconsistent models will result in poor search results.
  • Vector Field Name: Replace "embedding" in the knn query with the actual name of your vector field in the OpenSearch index.
  • k Value: Experiment with different values of k to find the optimal number of relevant results for your application.
  • Distance Metric (Optional): OpenSearch uses the space_type defined in your index mapping (when you created the knn_vector field) as the default distance metric. If you need to specify a different metric for a particular search, you can include a "space_type" parameter within the knn query (though this is less common).
  • Filtering (Optional): You can combine the knn query with other OpenSearch query clauses (like bool, filter, term, etc.) to further refine your search based on metadata (e.g., search within a specific manual or product).

This comprehensive example demonstrates how to perform a k-NN search in OpenSearch using the Python client, which is the core of how your API would retrieve relevant manual snippets based on a user’s question.