Embeddings & Vector Databases: A Practitioner's Guide (2026)
AI Engineering

Embeddings & Vector Databases: A Practitioner's Guide

What embeddings actually are, how vector databases work, and the practical decisions every team makes when building semantic search and RAG in 2026.

By Marcus Chia 2025-11-11 9 min read
Embeddings and vector databases practitioner guide 2026

Most production AI systems in 2026 — anything with retrieval, semantic search, recommendations, or memory — are built on top of two things: embeddings and vector databases. Understanding them is non-negotiable for anyone building AI applications. Yet the foundational concepts get glossed over in most tutorials.

This is the working practitioner's view: what these things actually are, how to choose between options, and the implementation choices that actually matter.

1. What an embedding actually is

An embedding is a list of numbers — typically 384, 768, or 1,536 numbers long — that represents a piece of text in a way that captures its meaning. Texts with similar meanings produce embeddings that are close together in the multi-dimensional space those numbers describe.

Two crucial properties:

  • Semantic similarity becomes geometric distance. "A cat sat on the mat" and "The feline was on the rug" produce embeddings that are close together — even though they share almost no words.
  • The mapping is consistent within a model. Two embeddings produced by the same model can be compared. Embeddings from different models cannot be compared meaningfully.

Embeddings are produced by an embedding model. The popular options in 2026 include OpenAI's text-embedding-3 family, Cohere embed-v4, Anthropic's Voyage embeddings, and a long tail of open-source models (Nomic, BGE, E5, multilingual variants). For most production systems, pick one and stick with it — switching embedding models requires re-embedding all your data.

2. How vector databases work

The embedding & retrieval lifecycle

The embedding & retrieval lifecycle 1Text inDocument or query
Any text — a paragraph, a document chunk, a user query. The unit you want to find or be findable later.
2EmbedEmbedding model
A model (OpenAI text-embedding-3, Cohere embed-v4, Voyage, BGE) converts the text into a vector of 384–1536 numbers.
3StoreVector database
pgvector, Pinecone, Weaviate, Qdrant. The vector is indexed using HNSW or IVF for fast approximate nearest-neighbour search.
4SearchANN query
At query time, embed the query the same way, then ask the DB for the K vectors closest in cosine distance. Returns in milliseconds even at millions of vectors.
5RetrieveTop-K results
Return the original text (or document IDs) corresponding to the closest vectors. This is what powers semantic search and the retrieval step in RAG.

A vector database is a specialised store for embeddings. Its core operation is nearest-neighbour search: given a query embedding, return the K stored embeddings closest to it. This is the operation behind every semantic search, every "find similar documents," and every RAG retrieval step.

The naive implementation — compute distance to every stored embedding — is O(n) and unworkable beyond about 10,000 vectors. Vector databases use approximate nearest-neighbour (ANN) algorithms — HNSW, IVF, ScaNN — that trade a small amount of recall for orders-of-magnitude faster query time. In practice this is invisible: you query and get results in milliseconds.

3. Choosing a vector database in 2026

Vector database decision tree (2026)

Vector database decision tree (2026) 1pgvector
If you already run Postgres and have under 10M vectors. Keeps vector data alongside transactional data, single backup story, sufficient performance for most workloads. Strong default.
2Managed (Pinecone, Weaviate Cloud, Qdrant Cloud)
Tens of millions of vectors and counting; or you do not have a Postgres team; or you need specific features (multi-tenancy, advanced hybrid search, namespaces). Hosted simplicity over operational control.
3Self-hosted specialised (Weaviate, Qdrant, Milvus)
Extreme scale (hundreds of millions to billions of vectors); strict data residency rules forbidding managed services; or specialised research workloads. Wrong default for most enterprise applications.
4Embedded (Chroma, LanceDB, FAISS)
Local development, prototypes, or single-node deployments where you want zero infrastructure. Not for multi-tenant production.

Three categories of choice, depending on your situation.

Use pgvector if you can

The Postgres extension. If your team already runs Postgres — and most do — adding pgvector keeps your vector data alongside your transactional data, with one set of credentials, one backup strategy, and one operational story. Performance is good enough for the vast majority of real-world workloads (under 10M vectors). The fewer specialised systems in your stack, the better.

Use a managed vector DB for scale or simplicity

Pinecone, Weaviate Cloud, Qdrant Cloud, Turbopuffer. Reasons to pick these: tens of millions of vectors and counting; you do not have a Postgres team; you want hosted simplicity over operational control; you need specific features (multi-tenancy, hybrid search, namespaces) that pgvector does not handle naturally.

Use a self-hosted specialised vector DB rarely

Self-hosting Weaviate, Qdrant, or Milvus. Reasons: extreme scale (hundreds of millions to billions of vectors); strict data residency requirements that forbid managed services; or specialised features required for research workloads. For most enterprise applications, this is the wrong default.

4. The implementation decisions that actually matter

Embedding model choice

Larger embeddings (1536+ dimensions) are not always better — they cost more to store and query. For most applications, a strong 768- or 1024-dimensional embedding model is the sweet spot. Test on your actual data with a holdout evaluation set; benchmark numbers from generic datasets often do not predict performance on your specific corpus.

Chunking strategy

You do not embed entire documents — you embed chunks. The decision of how to chunk is the single biggest determinant of retrieval quality. A 2000-character chunk with 200-character overlap is a reasonable default. For technical content, semantic chunking (splitting at topic boundaries detected by embedding similarity) typically outperforms fixed-size splitting. We dive into this further in our production RAG article.

Hybrid search

Pure vector search (semantic similarity) misses queries that hinge on specific terminology or proper nouns. Hybrid search combines vector search with traditional keyword search (BM25 or TF-IDF) and merges results using reciprocal rank fusion. The recall improvement is consistently meaningful and worth the implementation cost.

Metadata filtering

Real applications need to filter results — by user, tenant, date range, document category — alongside semantic search. Vector databases handle this very differently. pgvector lets you use any SQL WHERE clause; managed vector DBs typically have their own metadata schemas. Decide your filtering needs upfront; retrofitting is painful.

5. Operational gotchas

  • Re-embedding is expensive. If you switch embedding models, you must re-embed everything. Build for this from day one — keep ingestion idempotent and versioned.
  • Stale data is invisible. Unlike a database query that fails when data is missing, a vector search returns whatever is closest — even if your index is months out of date. Build refresh pipelines and monitor recency.
  • Cost scales with vectors stored, not just queries. Plan capacity. A million vectors at 1536 dimensions in float32 is about 6 GB before any indexing overhead.
  • Quality is not a single number. Recall, precision, latency, and cost trade off against each other. Pick the optimisation target your application actually cares about, then measure it.

6. The minimum viable evaluation

Before you ship any retrieval-based system, build a small evaluation set: 50–100 query-correct-answer pairs. Run your retrieval pipeline on it. Measure recall@5 and recall@10. This is the difference between "the demo works" and "the system works." The discipline costs a day and saves quarters of debugging.

For Malaysian teams building retrieval systems with proper evaluation discipline, our AI Engineering programme covers embeddings, vector databases, RAG, and evaluation hands-on, HRDC SBL-KHAS claimable for eligible employers.

About the author

Marcus Chia →

12+ yrs Product Design · Vibe Coding Specialist · ASEAN-scale Products

Marcus has 12+ years in product design and front-end engineering, having shipped consumer and SaaS products used by millions across ASEAN. He specialises in vibe-coding workflows that turn Figma concepts into deployable apps using Claude Code, Antigravity, and Cursor — and teaches non-developers to ship polished, user-centric interfaces in days rather than sprints.

Sources & References

All references checked at time of publication. AITraining2U is not affiliated with the cited sources.

Frequently Asked Questions

They are the same thing in the AI context. An embedding IS a vector — a list of numbers — produced by an embedding model to represent text (or images, audio, etc.) in a way that captures meaning. The terminology overlaps; 'embedding' emphasises the meaning-capture purpose, 'vector' emphasises the mathematical structure.

pgvector for the vast majority of cases — under 10M vectors, existing Postgres infrastructure, mixed transactional and vector workloads. Switch to a specialised managed service (Pinecone, Weaviate Cloud, Qdrant Cloud) when you cross tens of millions of vectors, when you need specific features pgvector does not handle (advanced multi-tenancy, complex hybrid search), or when you do not want to operate Postgres yourself.

Most RAG retrieval failures trace to the chunking strategy or the embedding model, not the database. Common causes: chunks too large (loses precision), chunks too small (loses context), pure vector search missing terminology-driven queries (use hybrid search), or stale data (build refresh pipelines). The retrieval layer is where 80 percent of RAG failures occur — invest evaluation effort there first.

OpenAI's text-embedding-3-small costs roughly USD 0.02 per million tokens in 2026. For a corpus of 1 million documents averaging 2,000 tokens each, the embedding cost is around USD 40. Re-embedding for a model upgrade is the same cost, applied to the same corpus. Storage costs are typically larger than embedding costs at scale.

Yes. AITraining2U's AI Engineering programme — covering embeddings, vector databases (pgvector, Pinecone, Weaviate), RAG architecture, and production evaluation — is HRDC SBL-KHAS claimable for eligible Malaysian employers.

Want to apply this in your organisation?

AITraining2U runs HRDC-claimable corporate AI training for Malaysian organisations — from leadership awareness to hands-on builder workshops. Talk to us about a programme tailored to your team.