Embeddings & Vector Search
From Floating-Point Vectors to Approximate Nearest Neighbours
Embeddings are how AI systems represent meaning numerically. Vector search is how you find related meaning at scale. Together, they're the infrastructure layer underneath retrieval-augmented generation, semantic search, recommendation, and classification.
What an embedding is
An embedding is a dense, fixed-length vector of floating-point numbers that represents a piece of content — a word, sentence, paragraph, image, or code snippet. It's the output of a model that has been trained to encode semantic meaning into a numerical space.
The essential property: content that means similar things should have similar vectors. "Dog" and "puppy" should be closer together in the vector space than "dog" and "democracy". Distance in the vector space corresponds to semantic similarity in the real world — to the degree the embedding model has learned to capture it.
// simplified 2D projection of an embedding space
← clusters = similar meaning, distance = semantic dissimilarity →
Real embedding spaces have hundreds or thousands of dimensions, not two. The 2D projection above loses most of the structure — but the clustering intuition holds. Similar content is geometrically close; dissimilar content is far apart.
Embedding models
Embedding models are a distinct category from generative models. A generative model produces text token by token. An embedding model takes an input (text, image, audio) and produces a single fixed-length vector. The two are often confused because they both involve large transformer architectures and significant compute.
Common embedding model families include the BERT family (encoder-only transformers, good for text), CLIP (text and images in the same space), and newer models like E5, BGE, and OpenAI's text-embedding series. The right model depends on what you're embedding and what task you're optimising for — retrieval, clustering, classification, and cross-modal search each have different requirements.
Dimensionality
The output size of an embedding model — how many numbers are in the vector — is called the embedding dimensionality. Common values range from 384 to 3072. Higher dimensionality can encode more nuance, but increases storage costs, memory requirements, and the compute cost of similarity search. Some models support Matryoshka representation learning, which allows you to truncate embeddings to a smaller size with graceful quality degradation.
Similarity search
Given a query vector, you want to find the vectors in your database that are closest to it. "Closest" is typically measured by cosine similarity (angle between vectors, invariant to magnitude) or dot product (magnitude-aware, commonly used when vectors are normalised). Euclidean distance is used less often in embedding contexts.
Exact nearest-neighbour search — compare the query vector to every vector in the database — is accurate but doesn't scale. Searching a database of 10 million vectors with a 1536-dimensional embedding is a substantial compute operation, and doing it on every query is infeasible at any meaningful throughput.
Approximate nearest neighbours (ANN)
Approximate nearest neighbour algorithms trade a small amount of recall for orders-of-magnitude faster search. They work by building an index structure during ingestion that allows the search phase to skip most of the database and focus on regions of the vector space likely to contain the nearest neighbours.
HNSW
Hierarchical Navigable Small World graphs are the dominant ANN algorithm in production use. They build a multi-layer graph structure where edges connect similar vectors. Search starts at the top layer (coarse resolution) and progressively refines through lower layers. HNSW has excellent query speed and recall, but requires significant memory — the graph structure is held in RAM for fast traversal.
IVF
Inverted File Index partitions the vector space into clusters, and search is restricted to the nearest clusters to the query. Lower memory overhead than HNSW, but query speed and recall depend heavily on how many clusters are probed — a tunable parameter that controls the accuracy/speed tradeoff.
Quantization of embeddings
Just as model weights can be quantised, embedding vectors can be compressed. Scalar quantization (reducing float32 to int8) reduces memory by roughly 4×. Binary quantization (one bit per dimension) reduces it further, with more significant recall degradation. For large-scale retrieval where recall can tolerate some loss, quantised embeddings can reduce both storage and search cost substantially.
Vector databases
A vector database is an index store with a query interface designed for similarity search. It handles ingestion (storing vectors and associated metadata), indexing (building the ANN structure), and search (returning top-k results for a query vector). Most also support filtering — restricting search to a subset of the database based on metadata conditions before or after the similarity search.
| System | Storage model | Best for |
|---|---|---|
| pgvector | PostgreSQL extension | Smaller datasets; existing Postgres infrastructure; exact + approximate search |
| Pinecone | Managed cloud service | Simple ops; serverless scale; teams without infra resources |
| Weaviate | Self-hosted or cloud | Multi-modal; hybrid BM25+vector search; rich filtering |
| Qdrant | Self-hosted or cloud | High-performance HNSW; payload filtering; Rust-based efficiency |
| Chroma | Embedded or self-hosted | Local dev; prototyping; small-scale retrieval |
Where embeddings sit in an AI system
In a RAG system, embeddings appear in two places: the ingestion pipeline (documents are chunked, embedded, and stored) and the query pipeline (the user's query is embedded, and similar document chunks are retrieved and placed in context for the generative model). The generative model itself doesn't use embeddings directly — it receives the retrieved text as part of its context window.
// Embedding model and generative model don't need to match
You can use any embedding model with any generative model. The embedding model is purely for retrieval — finding the right text to include in context. The generative model never sees the vectors; it only sees the text retrieved using those vectors. This means you can optimise embedding and generation independently.
// In short