Choosing a vector database: pgvector vs Pinecone vs Qdrant

An honest comparison of the three serious choices for production vector search in 2026 — what each one is good at, what they're not, and why pgvector wins more often than the marketing suggests.

YAEL Engineering24 Nov 20259 min read1,742 words

For ~80% of production RAG systems in 2026, pgvector running on the Postgres you already have is the right answer. It is fast enough up to tens of millions of vectors, it supports hybrid (vector + keyword) search natively, it joins to your relational data without a separate sync pipeline, and it costs you nothing beyond Postgres hosting. Pinecone is the right answer if you need >100M vectors with sub-100ms latency or if you don't want to run a database at all. Qdrant is the right answer if you need a self-hosted, feature-rich vector DB independent of your relational store. Pinecone, Weaviate, Milvus, Chroma — all viable in their niches. The question isn't usually "which is best." It's "which is best for the size of the problem I actually have."

We've shipped RAG on each of these. This is the honest take.

The decision in one paragraph

If you already have Postgres and your corpus is under ~10M vectors: pgvector. If you have over 100M vectors or want a fully managed service: Pinecone (or Postgres on Neon with their managed pgvector). If you need self-hosted with rich filtering and don't want to use Postgres: Qdrant. If you have specific exotic needs (graph search, image+text hybrid): Weaviate. If you want the simplest possible local option: Chroma or LanceDB.

The feature comparison that matters

| | pgvector | Pinecone | Qdrant | |---|---|---|---| | Hosting | Self-host on Postgres or managed (Neon, Supabase) | Fully managed only | Self-host or managed cloud | | Max practical vectors | ~10-100M with HNSW | Effectively unlimited | ~100M self-hosted, more managed | | Filtering | Full SQL on relational columns | Metadata filters, limited | Rich payload filters | | Hybrid search | tsvector + RRF native | None (use a wrapper) | BM25 native | | Joins to relational data | Free | Manual sync | Manual sync | | Cost (low volume) | $0 over Postgres | ~$70/mo minimum | $0 self-hosted | | Cost (high volume) | Postgres scaling | Linear with vectors | Linear with infra | | Latency (P95) | ~10-50ms for under 10M vectors | ~30-80ms | ~10-40ms | | Index types | HNSW, IVFFlat | HNSW (managed) | HNSW | | Operational complexity | Whatever Postgres needs | None | Real |

Why pgvector wins more often than you'd think

The dominant argument: you already have a relational database. Your documents are stored relationally. Your tenant model is relational. Your auth is relational. Adding a separate vector DB means a sync pipeline — chunks get written to two places, and they get out of sync, and the bug is invisible until a user notices.

With pgvector, the embedding is a column:

sql

create extension vector;

create table documents (
  id          text primary key,
  org_id      text not null,
  title       text not null,
  body        text not null,
  embedding   vector(1536),
  created_at  timestamptz not null default now()
);

-- HNSW index for fast approximate nearest neighbor
create index documents_embedding_idx
  on documents using hnsw (embedding vector_cosine_ops);

-- Standard btree for tenant filtering
create index documents_org_idx on documents(org_id);

A tenant-scoped search becomes one query:

sql

select id, title,
       1 - (embedding <=> $1) as score
from documents
where org_id = $2
order by embedding <=> $1
limit 8;

That's the entire vector search. With row-level security from multi-tenant Postgres RLS, even the where org_id = $2 is enforced at the DB level. No sync pipeline. No second source of truth. No cross-system race conditions.

When pgvector hits its limit

The real ceiling is around 10-30M vectors per HNSW index, depending on your hardware and query patterns. Past that:

HNSW build time gets long (hours, not minutes)
Memory usage of the index becomes significant
Query latency starts to climb past 100ms

You have three options at that scale:

Shard by tenant (each tenant gets its own table)
Use IVFFlat instead of HNSW (different trade-offs — faster build, slightly slower queries, more tuneable)
Move to a dedicated vector DB

Sharding by tenant is surprisingly effective because most queries are already tenant-scoped — you're searching within one tenant at a time anyway.

When Pinecone makes sense

Pinecone's pitch: it's the easiest production vector DB to operate. You don't run a database. You make API calls. The latency is consistent. The scaling is automatic.

Pick Pinecone when:

You have a non-engineering team building RAG and don't want to think about infra
Your corpus is genuinely huge (100M+ vectors)
You need cross-region replication out of the box
You can afford $70-700+/month in Pinecone fees on top of your other infra

Pinecone's filtering is weaker than Postgres or Qdrant. Metadata filters work but they're constrained — no joins, limited operators, no nested filtering. If your search needs are "vector similarity plus some metadata exact-match," fine. If they're more complex, Pinecone gets awkward.

When Qdrant makes sense

Qdrant is the self-hosted middle ground. Rich payload filtering, HNSW-only (good for most cases), Rust-based and fast.

Pick Qdrant when:

You can't or don't want to use Postgres
You need rich filter expressions on metadata
You have a self-hosted ops culture and the team to run it
You want sub-50ms P99 latency at 50M+ vectors

The trade-off: it's a separate database. You have a sync pipeline again. You have separate backup and disaster-recovery operations. None of this is unique to Qdrant — same applies to Pinecone, Weaviate, and Milvus.

Hybrid search — the under-appreciated win

Pure dense-vector search misses keyword matches that humans would expect. "Find docs about kubernetes" returns docs that semantically match "container orchestration" — and the user wonders why we didn't return the doc literally titled "Kubernetes."

Hybrid search combines BM25 (keyword) with dense vectors. Postgres natively supports this:

sql

select
  id, title,
  (
    0.5 * (1 - (embedding <=> $1))  -- vector similarity
    + 0.5 * ts_rank(to_tsvector('english', body), plainto_tsquery($2))  -- BM25-ish
  ) as score
from documents
where org_id = $3
order by score desc
limit 8;

This is reciprocal-rank-fusion-style hybrid. Better than either alone by 5-15% on quality benchmarks. Worth implementing.

Qdrant has native BM25 support. Pinecone doesn't — you'd run two queries and fuse them application-side.

Embedding model choices

The vector DB matters less than the embedding model. Choices:

OpenAI text-embedding-3-large (3072 dim) — strong baseline, expensive at scale
OpenAI text-embedding-3-small (1536 dim) — cheaper, ~95% of large's quality on most tasks
Cohere Embed v3 — competitive quality, often cheaper
BGE M3 / e5-large-v2 — self-hosted, near-OpenAI quality, free if you have the GPU
Voyage AI — niche but excellent for code embeddings

Storage cost scales linearly with dimension. A 3072-dim vector at float32 is 12KB. A million vectors is 12GB just in the embedding column. Halve the dimension and you halve the storage.

sql

-- Use a smaller embedding to save storage
alter table documents drop column embedding;
alter table documents add column embedding vector(768);
-- (re-embed with a smaller model)

Reranking — the highest-leverage step

Vector search returns top-K candidates. A cross-encoder reranker (Cohere Rerank, BGE reranker) re-orders them by true relevance. This typically improves retrieval quality more than upgrading the embedding model.

Reranking is independent of which vector DB you use:

async function searchAndRerank(query: string, orgId: string) {
  const queryVec = await embed(query);
  const candidates = await db.query<DocumentRow[]>(
    `select id, title, body from documents
     where org_id = $1
     order by embedding <=> $2
     limit 50`,
    [orgId, queryVec],
  );
  // Rerank with Cohere — expensive but very effective
  const reranked = await cohere.rerank({
    query,
    documents: candidates.map((c) => c.body),
    top_n: 8,
  });
  return reranked.results.map((r) => candidates[r.index]);
}

Top-50 → rerank → top-8. The combined cost is ~$0.001 per query at Cohere's rates. The quality gain is dramatic.

Operational notes

A few things we've learned shipping these in production:

HNSW builds are slow. Build the index once after bulk-loading; don't rebuild on every insert. Postgres handles incremental updates fine.
IVFFlat needs a sample. When creating the index, Postgres samples to learn the centroids. Make sure you have representative data loaded first.
Embedding cost dominates. At high volume, paying OpenAI for embeddings can exceed the cost of running Postgres. Plan for it.
Drift. If you re-embed your corpus with a new model, you must re-embed queries with the same model. Cross-model search returns garbage. Add a model-version column to your documents table.

sql

alter table documents add column embedding_model text default 'text-embedding-3-small';

Building a RAG pipeline?

We've shipped RAG with pgvector, Pinecone, and Qdrant — and we'll pick the right one for your scale.

See AI Agent service

FAQ

Can pgvector handle 100M vectors?

Technically yes, practically painful. Past ~30M per index, HNSW build times get long and memory pressure builds. Shard by tenant or move to a dedicated vector DB at that scale.

What about Weaviate / Milvus / Chroma?

Weaviate has strong filtering and a built-in vectorization pipeline — useful if you want one tool to do embedding + storage. Milvus is high-scale enterprise. Chroma is a delightful local-first tool that's not designed for high-volume production.

Does Supabase support pgvector?

Yes, natively. Same for Neon. Both are good hosted-pgvector options.

What about full-text + vector in one query?

Postgres does it natively via the hybrid pattern above. Most other vector DBs require either two queries (then app-side fusion) or a separate full-text store.

How many dimensions do I need?

For modern embedding models, 1024-1536 dims is enough for most domains. Drop to 768 if storage matters, push to 3072 if quality matters and you can afford the cost.

Can I use binary or scalar quantization?

Yes — pgvector supports half-precision and binary quantized vectors as of recent versions. Storage drops 4-32x with modest quality loss. Worth experimenting on read-heavy systems.

Should I store the original text alongside the vector?

Yes, in the same row. Querying the vector then needing to round-trip to fetch the text from another store is the pattern that pushes most teams off Pinecone.

How do I update an embedding when the document changes?

Re-embed the new text, update the row. With pgvector, it's a single update. With a separate vector DB, you delete + insert. Either way, plan for it — drift between text and vector is the silent killer of RAG quality.

Tagspgvector Pinecone Qdrant Vector DB RAG

ServiceAI Agent Development Automation Scripts

Keep reading

AI & AgentsRAG vs fine-tuning: when to pick each (and when to pick both)A practical decision framework for retrieval-augmented generation vs fine-tuning vs prompt engineering — with cost, latency, and update-frequency trade-offs.9 min read AI & AgentsBuilding AI agents with Claude tool use in productionWhat changes when an AI agent moves from demo to production — tool-call loops, error recovery, observability, cost controls, and the failure modes that only appear at scale.9 min read AI & AgentsSelf-hosting Llama vs Claude API: the real cost breakdownWhen self-hosting an open-weight LLM beats the Claude API, when it doesn't, and the operational costs nobody includes in their comparison.8 min read

AI & Agents