Most production AI systems in 2026 — anything with retrieval, semantic search, recommendations, or memory — are built on top of two things: embeddings and vector databases. Understanding them is non-negotiable for anyone building AI applications. Yet the foundational concepts get glossed over in most tutorials.
This is the working practitioner's view: what these things actually are, how to choose between options, and the implementation choices that actually matter.
1. What an embedding actually is
An embedding is a list of numbers — typically 384, 768, or 1,536 numbers long — that represents a piece of text in a way that captures its meaning. Texts with similar meanings produce embeddings that are close together in the multi-dimensional space those numbers describe.
Two crucial properties:
- Semantic similarity becomes geometric distance. "A cat sat on the mat" and "The feline was on the rug" produce embeddings that are close together — even though they share almost no words.
- The mapping is consistent within a model. Two embeddings produced by the same model can be compared. Embeddings from different models cannot be compared meaningfully.
Embeddings are produced by an embedding model. The popular options in 2026 include OpenAI's text-embedding-3 family, Cohere embed-v4, Anthropic's Voyage embeddings, and a long tail of open-source models (Nomic, BGE, E5, multilingual variants). For most production systems, pick one and stick with it — switching embedding models requires re-embedding all your data.
2. How vector databases work
The embedding & retrieval lifecycle
A vector database is a specialised store for embeddings. Its core operation is nearest-neighbour search: given a query embedding, return the K stored embeddings closest to it. This is the operation behind every semantic search, every "find similar documents," and every RAG retrieval step.
The naive implementation — compute distance to every stored embedding — is O(n) and unworkable beyond about 10,000 vectors. Vector databases use approximate nearest-neighbour (ANN) algorithms — HNSW, IVF, ScaNN — that trade a small amount of recall for orders-of-magnitude faster query time. In practice this is invisible: you query and get results in milliseconds.
3. Choosing a vector database in 2026
Vector database decision tree (2026)
Three categories of choice, depending on your situation.
Use pgvector if you can
The Postgres extension. If your team already runs Postgres — and most do — adding pgvector keeps your vector data alongside your transactional data, with one set of credentials, one backup strategy, and one operational story. Performance is good enough for the vast majority of real-world workloads (under 10M vectors). The fewer specialised systems in your stack, the better.
Use a managed vector DB for scale or simplicity
Pinecone, Weaviate Cloud, Qdrant Cloud, Turbopuffer. Reasons to pick these: tens of millions of vectors and counting; you do not have a Postgres team; you want hosted simplicity over operational control; you need specific features (multi-tenancy, hybrid search, namespaces) that pgvector does not handle naturally.
Use a self-hosted specialised vector DB rarely
Self-hosting Weaviate, Qdrant, or Milvus. Reasons: extreme scale (hundreds of millions to billions of vectors); strict data residency requirements that forbid managed services; or specialised features required for research workloads. For most enterprise applications, this is the wrong default.
4. The implementation decisions that actually matter
Embedding model choice
Larger embeddings (1536+ dimensions) are not always better — they cost more to store and query. For most applications, a strong 768- or 1024-dimensional embedding model is the sweet spot. Test on your actual data with a holdout evaluation set; benchmark numbers from generic datasets often do not predict performance on your specific corpus.
Chunking strategy
You do not embed entire documents — you embed chunks. The decision of how to chunk is the single biggest determinant of retrieval quality. A 2000-character chunk with 200-character overlap is a reasonable default. For technical content, semantic chunking (splitting at topic boundaries detected by embedding similarity) typically outperforms fixed-size splitting. We dive into this further in our production RAG article.
Hybrid search
Pure vector search (semantic similarity) misses queries that hinge on specific terminology or proper nouns. Hybrid search combines vector search with traditional keyword search (BM25 or TF-IDF) and merges results using reciprocal rank fusion. The recall improvement is consistently meaningful and worth the implementation cost.
Metadata filtering
Real applications need to filter results — by user, tenant, date range, document category — alongside semantic search. Vector databases handle this very differently. pgvector lets you use any SQL WHERE clause; managed vector DBs typically have their own metadata schemas. Decide your filtering needs upfront; retrofitting is painful.
5. Operational gotchas
- Re-embedding is expensive. If you switch embedding models, you must re-embed everything. Build for this from day one — keep ingestion idempotent and versioned.
- Stale data is invisible. Unlike a database query that fails when data is missing, a vector search returns whatever is closest — even if your index is months out of date. Build refresh pipelines and monitor recency.
- Cost scales with vectors stored, not just queries. Plan capacity. A million vectors at 1536 dimensions in float32 is about 6 GB before any indexing overhead.
- Quality is not a single number. Recall, precision, latency, and cost trade off against each other. Pick the optimisation target your application actually cares about, then measure it.
6. The minimum viable evaluation
Before you ship any retrieval-based system, build a small evaluation set: 50–100 query-correct-answer pairs. Run your retrieval pipeline on it. Measure recall@5 and recall@10. This is the difference between "the demo works" and "the system works." The discipline costs a day and saves quarters of debugging.
For Malaysian teams building retrieval systems with proper evaluation discipline, our AI Engineering programme covers embeddings, vector databases, RAG, and evaluation hands-on, HRDC SBL-KHAS claimable for eligible employers.