Embeddings & Vector Search — Inference Engineering

What an embedding is

An embedding is a dense, fixed-length vector of floating-point numbers that represents a piece of content — a word, sentence, paragraph, image, or code snippet. It's the output of a model that has been trained to encode semantic meaning into a numerical space.

The essential property: content that means similar things should have similar vectors. "Dog" and "puppy" should be closer together in the vector space than "dog" and "democracy". Distance in the vector space corresponds to semantic similarity in the real world — to the degree the embedding model has learned to capture it.

// simplified 2D projection of an embedding space

dog

puppy

cat

kitten

neural network

transformer

GPU

democracy

← clusters = similar meaning, distance = semantic dissimilarity →

Real embedding spaces have hundreds or thousands of dimensions, not two. The 2D projection above loses most of the structure — but the clustering intuition holds. Similar content is geometrically close; dissimilar content is far apart.

Embedding models

Embedding models are a distinct category from generative models. A generative model produces text token by token. An embedding model takes an input (text, image, audio) and produces a single fixed-length vector. The two are often confused because they both involve large transformer architectures and significant compute.

Common embedding model families include the BERT family (encoder-only transformers, good for text), CLIP (text and images in the same space), and newer models like E5, BGE, and OpenAI's text-embedding series. The right model depends on what you're embedding and what task you're optimising for — retrieval, clustering, classification, and cross-modal search each have different requirements.

Dimensionality

The output size of an embedding model — how many numbers are in the vector — is called the embedding dimensionality. Common values range from 384 to 3072. Higher dimensionality can encode more nuance, but increases storage costs, memory requirements, and the compute cost of similarity search. Some models support Matryoshka representation learning, which allows you to truncate embeddings to a smaller size with graceful quality degradation.

Similarity search

Given a query vector, you want to find the vectors in your database that are closest to it. "Closest" is typically measured by cosine similarity (angle between vectors, invariant to magnitude) or dot product (magnitude-aware, commonly used when vectors are normalised). Euclidean distance is used less often in embedding contexts.

Exact nearest-neighbour search — compare the query vector to every vector in the database — is accurate but doesn't scale. Searching a database of 10 million vectors with a 1536-dimensional embedding is a substantial compute operation, and doing it on every query is infeasible at any meaningful throughput.

Approximate nearest neighbours (ANN)

Approximate nearest neighbour algorithms trade a small amount of recall for orders-of-magnitude faster search. They work by building an index structure during ingestion that allows the search phase to skip most of the database and focus on regions of the vector space likely to contain the nearest neighbours.

HNSW

Hierarchical Navigable Small World graphs are the dominant ANN algorithm in production use. They build a multi-layer graph structure where edges connect similar vectors. Search starts at the top layer (coarse resolution) and progressively refines through lower layers. HNSW has excellent query speed and recall, but requires significant memory — the graph structure is held in RAM for fast traversal.

IVF

Inverted File Index partitions the vector space into clusters, and search is restricted to the nearest clusters to the query. Lower memory overhead than HNSW, but query speed and recall depend heavily on how many clusters are probed — a tunable parameter that controls the accuracy/speed tradeoff.

Quantization of embeddings

Just as model weights can be quantised, embedding vectors can be compressed. Scalar quantization (reducing float32 to int8) reduces memory by roughly 4×. Binary quantization (one bit per dimension) reduces it further, with more significant recall degradation. For large-scale retrieval where recall can tolerate some loss, quantised embeddings can reduce both storage and search cost substantially.

Vector databases

A vector database is an index store with a query interface designed for similarity search. It handles ingestion (storing vectors and associated metadata), indexing (building the ANN structure), and search (returning top-k results for a query vector). Most also support filtering — restricting search to a subset of the database based on metadata conditions before or after the similarity search.

System	Storage model	Best for
pgvector	PostgreSQL extension	Smaller datasets; existing Postgres infrastructure; exact + approximate search
Pinecone	Managed cloud service	Simple ops; serverless scale; teams without infra resources
Weaviate	Self-hosted or cloud	Multi-modal; hybrid BM25+vector search; rich filtering
Qdrant	Self-hosted or cloud	High-performance HNSW; payload filtering; Rust-based efficiency
Chroma	Embedded or self-hosted	Local dev; prototyping; small-scale retrieval

Where embeddings sit in an AI system

In a RAG system, embeddings appear in two places: the ingestion pipeline (documents are chunked, embedded, and stored) and the query pipeline (the user's query is embedded, and similar document chunks are retrieved and placed in context for the generative model). The generative model itself doesn't use embeddings directly — it receives the retrieved text as part of its context window.

// Embedding model and generative model don't need to match

You can use any embedding model with any generative model. The embedding model is purely for retrieval — finding the right text to include in context. The generative model never sees the vectors; it only sees the text retrieved using those vectors. This means you can optimise embedding and generation independently.

// In short

01An embedding is a dense vector that encodes semantic meaning. Similar content maps to geometrically nearby points in the vector space.

02Embedding models are distinct from generative models. They produce a fixed-length vector, not token sequences. They're optimised for retrieval, not generation.

03Exact search doesn't scale. Approximate nearest-neighbour (ANN) algorithms like HNSW trade a small amount of recall for orders-of-magnitude faster query times.

04Vector databases manage ANN indexes with metadata filtering. Choice depends on scale, operational complexity tolerance, and whether hybrid (keyword + vector) search is needed.

05In RAG, embeddings drive retrieval only. The generative model receives text, not vectors. You can change the embedding model without changing the generative model.