Here’s how you can load documents into OpenSearch for vector search:
1. Create a k-NN Index
First, you need to create an index in OpenSearch that is configured for k-Nearest Neighbors (k-NN) search. This involves setting index.knn to true and defining the field that will store your vector embeddings as type knn_vector. You also need to specify the dimension of your vectors, which should match the output dimension of the embedding model you’re using.
JSON
PUT /my-vector-index
{
"settings": {
"index.knn": true
},
"mappings": {
"properties": {
"my_vector": {
"type": "knn_vector",
"dimension": 768
},
"text_field": {
"type": "text"
}
}
}
}
In this example:
my-vector-indexis the name of the index.my_vectoris the field that will store the vector embeddings.dimensionis set to768, which is a common dimension for sentence transformer models. Adjust this value according to your model.text_fieldis an example of another field you might want to index along with your vectors.
2. Set up an Ingest Pipeline (Optional but Recommended)
If you want to generate embeddings directly within OpenSearch during ingestion, you’ll need to create an ingest pipeline. This pipeline will use a processor to transform your text data into vector embeddings.
- Register and deploy a model: If you want to generate embeddings within OpenSearch, you’ll need to register and deploy a machine learning model. JSON
POST /_plugins/_ml/models/_register?deploy=true { "name": "huggingface/sentence-transformers/all-distilroberta-v1", "version": "1.0.1", "model_format": "TORCH_SCRIPT" } - Create an ingest pipeline: Create a pipeline that uses the
text_embeddingprocessor to generate embeddings. You’ll need themodel_idfrom the previous step. JSONPUT /_ingest/pipeline/my-embedding-pipeline { "processors": [ { "text_embedding": { "model_id": "<model_id>", "field_map": { "text_field": "my_vector" } } } ] }In this example:my-embedding-pipelineis the name of the ingest pipeline.text_fieldis the field containing the text to be embedded.my_vectoris the field where the generated embedding will be stored.
- Set the default pipeline: When creating your index, set the
default_pipelineto the name of your ingest pipeline. JSONPUT /my-vector-index { "settings": { "index.knn": true, "default_pipeline": "my-embedding-pipeline" }, "mappings": { "properties": { "my_vector": { "type": "knn_vector", "dimension": 768 }, "text_field": { "type": "text" } } } }
3. Ingest Data
Now you can ingest your documents into the index. If you’re using an ingest pipeline, the text will be automatically converted into embeddings. If not, you’ll need to generate the embeddings yourself and include them in the documents.
- Bulk API: Use the Bulk API for efficient ingestion of multiple documents. JSON
POST /my-vector-index/_bulk { "index": { "_index": "my-vector-index" } } { "text_field": "This is document 1", "my_vector": [0.1, 0.2, 0.3, ...] } { "index": { "_index": "my-vector-index" } } { "text_field": "This is document 2", "my_vector": [0.4, 0.5, 0.6, ...] }If you are using an ingest pipeline, you only need to provide the text: JSONPOST /my-vector-index/_bulk { "index": { "_index": "my-vector-index" } } { "text_field": "This is document 1" } { "index": { "_index": "my-vector-index" } } { "text_field": "This is document 2" }
4. Search
Once your data is indexed, you can perform k-NN searches to find similar documents based on their vector embeddings.
