Here’s how you can load documents into OpenSearch for vector search:
1. Create a k-NN Index
First, you need to create an index in OpenSearch that is configured for k-Nearest Neighbors (k-NN) search. This involves setting index.knn
to true
and defining the field that will store your vector embeddings as type knn_vector
. You also need to specify the dimension
of your vectors, which should match the output dimension of the embedding model you’re using.
JSON
PUT /my-vector-index
{
"settings": {
"index.knn": true
},
"mappings": {
"properties": {
"my_vector": {
"type": "knn_vector",
"dimension": 768
},
"text_field": {
"type": "text"
}
}
}
}
In this example:
my-vector-index
is the name of the index.my_vector
is the field that will store the vector embeddings.dimension
is set to768
, which is a common dimension for sentence transformer models. Adjust this value according to your model.text_field
is an example of another field you might want to index along with your vectors.
2. Set up an Ingest Pipeline (Optional but Recommended)
If you want to generate embeddings directly within OpenSearch during ingestion, you’ll need to create an ingest pipeline. This pipeline will use a processor to transform your text data into vector embeddings.
- Register and deploy a model: If you want to generate embeddings within OpenSearch, you’ll need to register and deploy a machine learning model. JSON
POST /_plugins/_ml/models/_register?deploy=true { "name": "huggingface/sentence-transformers/all-distilroberta-v1", "version": "1.0.1", "model_format": "TORCH_SCRIPT" }
- Create an ingest pipeline: Create a pipeline that uses the
text_embedding
processor to generate embeddings. You’ll need themodel_id
from the previous step. JSONPUT /_ingest/pipeline/my-embedding-pipeline { "processors": [ { "text_embedding": { "model_id": "<model_id>", "field_map": { "text_field": "my_vector" } } } ] }
In this example:my-embedding-pipeline
is the name of the ingest pipeline.text_field
is the field containing the text to be embedded.my_vector
is the field where the generated embedding will be stored.
- Set the default pipeline: When creating your index, set the
default_pipeline
to the name of your ingest pipeline. JSONPUT /my-vector-index { "settings": { "index.knn": true, "default_pipeline": "my-embedding-pipeline" }, "mappings": { "properties": { "my_vector": { "type": "knn_vector", "dimension": 768 }, "text_field": { "type": "text" } } } }
3. Ingest Data
Now you can ingest your documents into the index. If you’re using an ingest pipeline, the text will be automatically converted into embeddings. If not, you’ll need to generate the embeddings yourself and include them in the documents.
- Bulk API: Use the Bulk API for efficient ingestion of multiple documents. JSON
POST /my-vector-index/_bulk { "index": { "_index": "my-vector-index" } } { "text_field": "This is document 1", "my_vector": [0.1, 0.2, 0.3, ...] } { "index": { "_index": "my-vector-index" } } { "text_field": "This is document 2", "my_vector": [0.4, 0.5, 0.6, ...] }
If you are using an ingest pipeline, you only need to provide the text: JSONPOST /my-vector-index/_bulk { "index": { "_index": "my-vector-index" } } { "text_field": "This is document 1" } { "index": { "_index": "my-vector-index" } } { "text_field": "This is document 2" }
4. Search
Once your data is indexed, you can perform k-NN searches to find similar documents based on their vector embeddings.