Loading documents into OpenSearch for vector search

Here’s how you can load documents into OpenSearch for vector search:

1. Create a k-NN Index

First, you need to create an index in OpenSearch that is configured for k-Nearest Neighbors (k-NN) search. This involves setting index.knn to true and defining the field that will store your vector embeddings as type knn_vector. You also need to specify the dimension of your vectors, which should match the output dimension of the embedding model you’re using.

JSON

PUT /my-vector-index
{
  "settings": {
    "index.knn": true
  },
  "mappings": {
    "properties": {
      "my_vector": {
        "type": "knn_vector",
        "dimension": 768
      },
      "text_field": {
        "type": "text"
      }
    }
  }
}

In this example:

  • my-vector-index is the name of the index.
  • my_vector is the field that will store the vector embeddings.
  • dimension is set to 768, which is a common dimension for sentence transformer models. Adjust this value according to your model.
  • text_field is an example of another field you might want to index along with your vectors.

2. Set up an Ingest Pipeline (Optional but Recommended)

If you want to generate embeddings directly within OpenSearch during ingestion, you’ll need to create an ingest pipeline. This pipeline will use a processor to transform your text data into vector embeddings.

  • Register and deploy a model: If you want to generate embeddings within OpenSearch, you’ll need to register and deploy a machine learning model. JSONPOST /_plugins/_ml/models/_register?deploy=true { "name": "huggingface/sentence-transformers/all-distilroberta-v1", "version": "1.0.1", "model_format": "TORCH_SCRIPT" }
  • Create an ingest pipeline: Create a pipeline that uses the text_embedding processor to generate embeddings. You’ll need the model_id from the previous step. JSONPUT /_ingest/pipeline/my-embedding-pipeline { "processors": [ { "text_embedding": { "model_id": "<model_id>", "field_map": { "text_field": "my_vector" } } } ] } In this example:
    • my-embedding-pipeline is the name of the ingest pipeline.
    • text_field is the field containing the text to be embedded.
    • my_vector is the field where the generated embedding will be stored.
  • Set the default pipeline: When creating your index, set the default_pipeline to the name of your ingest pipeline. JSONPUT /my-vector-index { "settings": { "index.knn": true, "default_pipeline": "my-embedding-pipeline" }, "mappings": { "properties": { "my_vector": { "type": "knn_vector", "dimension": 768 }, "text_field": { "type": "text" } } } }

3. Ingest Data

Now you can ingest your documents into the index. If you’re using an ingest pipeline, the text will be automatically converted into embeddings. If not, you’ll need to generate the embeddings yourself and include them in the documents.

  • Bulk : Use the Bulk API for efficient ingestion of multiple documents. JSONPOST /my-vector-index/_bulk { "index": { "_index": "my-vector-index" } } { "text_field": "This is document 1", "my_vector": [0.1, 0.2, 0.3, ...] } { "index": { "_index": "my-vector-index" } } { "text_field": "This is document 2", "my_vector": [0.4, 0.5, 0.6, ...] } If you are using an ingest pipeline, you only need to provide the text: JSONPOST /my-vector-index/_bulk { "index": { "_index": "my-vector-index" } } { "text_field": "This is document 1" } { "index": { "_index": "my-vector-index" } } { "text_field": "This is document 2" }

4. Search

Once your data is indexed, you can perform k-NN searches to find similar documents based on their vector embeddings.