Estimated reading time: 13 minutes

Vector DB Weaviate Advanced Internal Concepts and Code Snippets

Current image: black and white photo of balconies of skyscrapers

Weaviate Internal Concepts and Code Snippets

This document explores the core internal concepts of Weaviate, an open-source , and provides illustrative code snippets using the client library to demonstrate its usage.

Internal Concepts of Weaviate

Schema and Collections

  • Schema: Defines the structure of your data, including classes (now called Collections in newer versions), properties, and their data types. The schema also configures vectorization and .
  • Collections (formerly Classes): Logical groupings of data objects, similar to tables in a relational database. Each object within a collection has a set of properties defined in the schema and an associated vector.

Vectorization Modules

  • Plug-in Architecture: Weaviate has a modular architecture for vectorization. You can choose from various modules (e.g., `text2vec-openai`, `text2vec-transformers`, `img2vec-neural`) to generate vectors from your data objects based on their properties.
  • Custom Vectorization: You can also provide your own vectors during data ingestion, bypassing Weaviate’s built-in vectorization modules.

Vector Indexing (HNSW)

  • Approximate Nearest Neighbors (ANN): Weaviate uses the Hierarchical Navigable Small World (HNSW) by default for efficient approximate nearest neighbor search. HNSW Paper
  • Index Configuration: You can tune HNSW parameters (e.g., `efConstruction`, `ef`, `m`) in the schema to balance build time, index size, and query . Weaviate Vector Index Configuration

Inverted Index

  • Filtering and Keyword Search: Weaviate uses an inverted index to efficiently filter data objects based on property values and to support BM25 keyword-based searches (hybrid search). Weaviate Indexing Concepts
  • Tokenization and Indexing: Text properties are tokenized and indexed, allowing for fast lookups based on keywords.

Data Storage

  • Object Store: Weaviate stores the actual data objects (properties and their values) in a persistent storage layer. The specific storage backend can be configured (e.g., embedded RocksDB, distributed).
  • Vector Storage: The vector are stored and indexed separately for efficient similarity calculations.

  • Flexible Data Interaction: Weaviate exposes a powerful GraphQL API that allows you to query, add, update, and delete data objects and explore the of interconnected data. Weaviate GraphQL API
  • Vector Search Integration: The GraphQL API includes specialized operators for performing vector similarity searches (`nearVector`, `nearObject`).

Code Snippets (Python Client)

1. Connecting to Weaviate and Checking Connection

import weaviate
import os

# Initialize Weaviate client (replace with your cluster URL)
client = weaviate.Client(
    url=os.environ.get("WEAVIATE_URL"),  # Replace with your Weaviate URL
    auth_client_secret=weaviate.AuthApiKey(api_key=os.environ.get("WEAVIATE_API_KEY")) # Replace with your API key if needed
)

if client.is_ready():
    print("Weaviate client is ready.")
else:
    print("Weaviate client is not ready.")

Connecting to a Weaviate instance and checking its readiness.

2. Defining a Schema (Collection)

import weaviate
import os

# Initialize Weaviate client (replace with your cluster URL)
client = weaviate.Client(
    url=os.environ.get("WEAVIATE_URL"),  # Replace with your Weaviate URL
    auth_client_secret=weaviate.AuthApiKey(api_key=os.environ.get("WEAVIATE_API_KEY")) # Replace with your API key if needed
)

collection_name = "Article"

client.collection.create(
    name=collection_name,
    vectorizer="text2vec-openai", # Choose your vectorizer module
    fields=[
        weaviate.config.Property(name="title", data_type=weaviate.config.DataType.TEXT),
        weaviate.config.Property(name="content", data_type=weaviate.config.DataType.TEXT)
    ]
)

print(f"Collection '{collection_name}' created.")

Creating a new collection named “Article” with title and content properties, using OpenAI for vectorization.

3. Adding Data Objects

import weaviate
import os

# Initialize Weaviate client (replace with your cluster URL)
client = weaviate.Client(
    url=os.environ.get("WEAVIATE_URL"),  # Replace with your Weaviate URL
    auth_client_secret=weaviate.AuthApiKey(api_key=os.environ.get("WEAVIATE_API_KEY")) # Replace with your API key if needed
)

collection_name = "Article"

articles = [
    {"title": "The Future of AI", "content": "Artificial intelligence is rapidly evolving..."},
    {"title": "Quantum Computing Explained", "content": "Quantum computing harnesses the principles..."},
]

with client.batch as batch:
    for article in articles:
        batch.add_object(
            collection=collection_name,
            properties=article
        )

print("Articles added to the collection.")

Adding multiple data objects (articles) to the “Article” collection using batching for efficiency.

4. Performing Vector Search

import weaviate
import os
import numpy as np

# Initialize Weaviate client (replace with your cluster URL)
client = weaviate.Client(
    url=os.environ.get("WEAVIATE_URL"),  # Replace with your Weaviate URL
    auth_client_secret=weaviate.AuthApiKey(api_key=os.environ.get("WEAVIATE_API_KEY")) # Replace with your API key if needed
)

collection_name = "Article"
query_vector = np.random.rand(768).tolist() # Example vector (adjust dimension based on your vectorizer)

response = client.collection(collection_name).query.near_vector(
    vector=query_vector,
    limit=3,
    return_properties=["title", "content", "_additional { distance }"]
)

print(response.data)

Performing a vector similarity search using a query vector and retrieving the title, content, and distance of the top 3 nearest articles.

5. Performing Hybrid Search (Vector + Keyword)

import weaviate
import os

# Initialize Weaviate client (replace with your cluster URL)
client = weaviate.Client(
    url=os.environ.get("WEAVIATE_URL"),  # Replace with your Weaviate URL
    auth_client_secret=weaviate.AuthApiKey(api_key=os.environ.get("WEAVIATE_API_KEY")) # Replace with your API key if needed
)

collection_name = "Article"
query = "future of computing"

response = client.collection(collection_name).query.hybrid(
    query=query,
    weight=0.6, # Weight for vector search
    bm25_weight=0.4, # Weight for BM25 search
    limit=3,
    return_properties=["title", "content", "_additional { score }"]
)

print(response.data)

Performing a hybrid search combining vector similarity and keyword matching.

6. Filtering with the Inverted Index

import weaviate
import os

# Initialize Weaviate client (replace with your cluster URL)
client = weaviate.Client(
    url=os.environ.get("WEAVIATE_URL"),  # Replace with your Weaviate URL
    auth_client_secret=weaviate.AuthApiKey(api_key=os.environ.get("WEAVIATE_API_KEY")) # Replace with your API key if needed
)

collection_name = "Article"

response = client.collection(collection_name).query.get(
    limit=3,
    where={
        "path": ["title"],
        "operator": "Like",
        "valueText": "*AI*"
    },
    return_properties=["title", "content"]
).do()

print(response.data)

Filtering articles where the title contains “AI” using the inverted index.

7. Getting Collection Information

import weaviate
import os

# Initialize Weaviate client (replace with your cluster URL)
client = weaviate.Client(
    url=os.environ.get("WEAVIATE_URL"),  # Replace with your Weaviate URL
    auth_client_secret=weaviate.AuthApiKey(api_key=os.environ.get("WEAVIATE_API_KEY")) # Replace with your API key if needed
)

collection_name = "Article"

collection = client.collection.get(collection_name)
description = collection.describe()
print(description)

Retrieving and printing the description of the “Article” collection (schema information).

These code snippets provide a basic introduction to interacting with Weaviate using the Python client. Weaviate offers many more advanced features and query options through its GraphQL API.

Advanced Weaviate Internal Concepts and Architecture

Advanced Weaviate Internal Concepts and Architecture

This document explores advanced internal concepts of Weaviate, an open-source vector database, complemented by illustrative code snippets using the Python client library and a high-level architectural overview.

Advanced Internal Concepts

Hybrid Search Strategies

Weaviate supports hybrid search, combining vector similarity with keyword-based (BM25) retrieval for richer results.

  • Weighted Fusion: Weaviate allows you to specify weights for the vector and BM25 scores, influencing the final ranking of results.
  • Optimized Execution: Weaviate efficiently executes both vector and keyword searches and merges the results based on the provided weights.

Code Snippet (Hybrid Search)

import weaviate
import os

# Initialize Weaviate client (replace with your cluster URL)
client = weaviate.Client(
    url=os.environ.get("WEAVIATE_URL"),  # Replace with your Weaviate URL
    auth_client_secret=weaviate.AuthApiKey(api_key=os.environ.get("WEAVIATE_API_KEY")) # Replace with your API key if needed
)

collection_name = "YourCollectionName" # Replace with your collection name
query = "articles about artificial intelligence"

response = client.collection(collection_name).query.hybrid(
    query=query,
    vector=None, # Optionally provide a specific vector
    weight=0.6, # Weight for vector search (0.0 to 1.0)
    bm25_weight=0.4, # Weight for BM25 search (0.0 to 1.0)
    limit=5,
    return_properties=["title", "content", "_additional { score }"]
)

print(response.data)

Performing a hybrid search with specified weights for vector and BM25 components.

Vector Indexing (HNSW and More)

Weaviate’s vector indexing is crucial for efficient similarity searches.

  • Hierarchical Navigable Small World (HNSW): This is the default and highly performant approximate nearest neighbor (ANN) algorithm used by Weaviate, balancing speed and accuracy for large datasets. (HNSW Paper)
  • Flat Index: For smaller datasets, Weaviate offers a flat (brute-force) index that provides exact nearest neighbors.
  • Dynamic Index: Weaviate can automatically switch between a flat and HNSW index based on the data size.
  • Index Configuration: You can fine-tune HNSW parameters (e.g., `efConstruction`, `ef`) during schema definition to optimize for build speed, query speed, and recall. (Weaviate Vector Index Configuration)

Code Snippet (Schema Definition with HNSW Configuration)

import weaviate
import os

# Initialize Weaviate client (replace with your cluster URL)
client = weaviate.Client(
    url=os.environ.get("WEAVIATE_URL"),  # Replace with your Weaviate URL
    auth_client_secret=weaviate.AuthApiKey(api_key=os.environ.get("WEAVIATE_API_KEY")) # Replace with your API key if needed
)

collection_name = "MyArticleCollection"

client.collection.create(
    name=collection_name,
    vectorizer="text2vec-openai", # Or any other vectorizer module
    fields=[
        weaviate.config.Property(name="title", data_type=weaviate.config.DataType.TEXT),
        weaviate.config.Property(name="content", data_type=weaviate.config.DataType.TEXT)
    ],
    vector_index_config=weaviate.config.Configure.VectorIndex.hnsw(
        ef_construction=128,
        ef=-1, # Dynamic ef
        m=16
    )
)

print(f"Collection '{collection_name}' created with HNSW configuration.")

Defining a schema with specific HNSW parameters for the vector index.

Inverted Index for Filtering and Keyword Search

Weaviate uses an inverted index to efficiently handle filters and power BM25 keyword searches.

  • Searchable and Filterable Indices: Properties can be configured with `indexSearchable` (for BM25) and `indexFilterable` flags in the schema. (Weaviate Indexing Concepts)
  • Range Filtering: Weaviate supports efficient filtering based on numerical and date ranges.
  • Bitmap Indices: For filterable properties, Weaviate often uses Roaring Bitmap indices for fast set operations during filtering.

Code Snippet (Schema Definition with Inverted Index Configuration)

import weaviate
import os

# Initialize Weaviate client (replace with your cluster URL)
client = weaviate.Client(
    url=os.environ.get("WEAVIATE_URL"),  # Replace with your Weaviate URL
    auth_client_secret=weaviate.AuthApiKey(api_key=os.environ.get("WEAVIATE_API_KEY")) # Replace with your API key if needed
)

collection_name = "ProductCollection"

client.collection.create(
    name=collection_name,
    fields=[
        weaviate.config.Property(name="name", data_type=weaviate.config.DataType.TEXT, index_searchable=True, index_filterable=True),
        weaviate.config.Property(name="price", data_type=weaviate.config.DataType.NUMBER, index_filterable=True, index_range_filter=True),
        weaviate.config.Property(name="category", data_type=weaviate.config.DataType.TEXT, index_filterable=True)
    ]
)

print(f"Collection '{collection_name}' created with inverted index configuration.")

Defining a schema with configurations for searchable, filterable, and range-filterable inverted indices.

Data Sharding and Replication

Weaviate is designed for horizontal scalability and high availability through sharding and replication.

  • Sharding: Data within a collection is divided into shards and distributed across multiple nodes in a cluster. The number of shards can be configured during collection creation. (Weaviate Cluster Architecture)
  • Replication: Each shard can have multiple replicas on different nodes, ensuring data redundancy and fault tolerance. The replication factor is also configurable. (Weaviate Replication Architecture)
  • Leaderless Architecture (for Data): Weaviate uses a leaderless replication strategy for data, inspired by Cassandra, to avoid single points of failure and improve write availability.
  • Raft Consensus (for Metadata): Cluster metadata (schema, shard states) is managed using the Raft consensus algorithm for consistency.

Code Snippet (Collection Creation with Sharding and Replication)

import weaviate
import os

# Initialize Weaviate client (replace with your cluster URL)
client = weaviate.Client(
    url=os.environ.get("WEAVIATE_URL"),  # Replace with your Weaviate URL
    auth_client_secret=weaviate.AuthApiKey(api_key=os.environ.get("WEAVIATE_API_KEY")) # Replace with your API key if needed
)

collection_name = "LargeScaleCollection"

client.collection.create(
    name=collection_name,
    shards=3,
    replication_factor=2,
    fields=[
        weaviate.config.Property(name="text", data_type=weaviate.config.DataType.TEXT)
    ],
    vectorizer="text2vec-openai"
)

print(f"Collection '{collection_name}' created with 3 shards and a replication factor of 2.")

Creating a collection with a specified number of shards and replicas for scalability and availability.

Vector Compression (Product and Scalar Quantization)

To reduce memory usage and improve query performance, Weaviate supports vector compression techniques.

  • Product Quantization (PQ): Divides vectors into segments and quantizes each segment independently, significantly reducing vector size with a potential trade-off in accuracy. (Weaviate Vector Quantization)
  • Scalar Quantization (SQ): Quantizes each dimension of the vector independently.
  • Configuration: Compression techniques can be configured at the collection level in the schema.

Code Snippet (Schema Definition with PQ Compression)

import weaviate
import os

# Initialize Weaviate client (replace with your cluster URL)
client = weaviate.Client(
    url=os.environ.get("WEAVIATE_URL"),  # Replace with your Weaviate URL
    auth_client_secret=weaviate.AuthApiKey(api_key=os.environ.get("WEAVIATE_API_KEY")) # Replace with your API key if needed
)

collection_name = "CompressedCollection"
dimension = 768 # Example dimension

client.collection.create(
    name=collection_name,
    vectorizer="text2vec-openai",
    fields=[
        weaviate.config.Property(name="text", data_type=weaviate.config.DataType.TEXT)
    ],
    vector_index_config=weaviate.config.Configure.VectorIndex.hnsw(
        ef_construction=128,
        ef=-1,
        m=16,
        vector_compression=weaviate.config.Configure.VectorIndex.Compression.pq(
            segments=96,
            centroids=256,
            training_limit=100000
        )
    )
)

print(f"Collection '{collection_name}' created with Product Quantization.")

Defining a schema with Product Quantization for vector compression in the HNSW index.

Weaviate Architecture (High-Level)

The following is a simplified, high-level view of Weaviate’s architecture:

        +---------------------+       +---------------------+
        |    Client (SDKs,    |------>|     Weaviate Node   |
        | REST, GraphQL)      |       +---------------------+
        +---------------------+               |
                                              | (Internal Communication)
        +---------------------------------------+
        |
        v
        +---------------------+   +---------------------+   +---------------------+
        |       Shard 1       |   |       Shard 2       |   |       Shard N       |
        | (Object Store,      |   | (Object Store,      |   | (Object Store,      |
        | Inverted Index,     |   | Inverted Index,     |   | Inverted Index,     |
        | Vector Index)       |   | Vector Index)       |   | Vector Index)       |
        +---------------------+   +---------------------+   +---------------------+
               ^                     ^                     ^
               |                     |                     |
        +-------------------------------------------------------+
        |                  Distributed Cluster                  |
        +-------------------------------------------------------+
        

Explanation:

  • Client: Users interact with Weaviate through various clients (Python, Go, JavaScript), REST API, or GraphQL API.
  • Weaviate Node: A single instance of the Weaviate server. A cluster consists of multiple nodes.
  • Shards: Collections are horizontally partitioned into shards, distributed across the nodes in the cluster for scalability. Each shard contains:
    • Object Store: Stores the raw data objects.
    • Inverted Index: Enables efficient filtering and keyword (BM25) search.
    • Vector Index (e.g., HNSW): Allows for fast approximate nearest neighbor searches.
  • Distributed Cluster: Multiple Weaviate nodes working together, managing shards and replicas for scalability and high availability. The cluster uses a combination of leaderless (for data) and Raft-based (for metadata) replication.

This architectural overview is a simplification. Weaviate’s internal workings involve more intricate details for managing consistency, concurrency, and resource allocation.

Agentic AI (21) AI Agent (18) airflow (7) Algorithm (27) Algorithms (59) apache (31) apex (2) API (100) Automation (54) Autonomous (34) auto scaling (5) AWS (53) Azure (39) BigQuery (15) bigtable (8) blockchain (1) Career (5) Chatbot (19) cloud (106) cosmosdb (3) cpu (42) cuda (18) Cybersecurity (7) database (92) Databricks (7) Data structure (18) Design (85) dynamodb (23) ELK (3) embeddings (42) emr (7) flink (9) gcp (25) Generative AI (13) gpu (13) graph (47) graph database (15) graphql (4) image (45) indexing (32) interview (7) java (40) json (35) Kafka (21) LLM (27) LLMs (45) Mcp (5) monitoring (98) Monolith (3) mulesoft (1) N8n (3) Networking (13) NLU (4) node.js (20) Nodejs (2) nosql (22) Optimization (75) performance (198) Platform (87) Platforms (66) postgres (3) productivity (18) programming (50) pseudo code (1) python (66) pytorch (35) RAG (43) rasa (4) rdbms (5) ReactJS (4) realtime (1) redis (13) Restful (8) rust (2) salesforce (10) Spark (17) spring boot (5) sql (57) tensor (17) time series (14) tips (16) tricks (4) use cases (48) vector (62) vector db (5) Vertex AI (18) Workflow (44) xpu (1)

Leave a Reply