By Rohit Gupta — 05 Jun 2026

The Vector Store Decision: Aligning Embedding Infrastructure with Operational Reality

Storing embeddings has become a routine step in enabling semantic search and feeding context into LLM infrastructure. The vector database chosen to index those embeddings dictates query latency, recall precision, and the operational burden on the backend team. Three paths now dominate production stacks: pgvector, an extension that turns an existing Postgres instance into a vector store; Milvus, a dedicated distributed search system; and Pinecone, a managed API that abstracts all infrastructure. The choice between them determines how deeply the vector tier couples to transactional data, who owns index uptime, and how the system handles growth.

Engineering Coupling and Ownership: Managed Services Don’t Automatically Simplify

Pgvector embeds the vector store inside the relational database that likely already holds product catalogs, user profiles, or document metadata. That coupling eliminates network hops between transactional and vector queries, allowing single-statement joins and ACID guarantees across both. The trade-off requires the team to manage index building and memory allocation on the same database that processes OLTP traffic. Milvus decouples the vector index into its own service, introducing a separate deployment, SDK, and persistence layer. Ownership shifts: the team owns a distributed system that demands familiarity with segment compaction, index parameter tuning, and etcd or Pulsar coordination. Pinecone removes infrastructure ownership entirely, but it also removes direct access to the index internals. A backend group that cannot tune the HNSW graph parameters may ship a system that quietly degrades recall under high-dimensional data, with no raw access to logs or index segments to diagnose the drift. The decision hinges on whether the team can afford to own additional stateful components or can accept a black box optimized for common patterns.

Placement Driven by Transactional Boundaries and Throughput Profiles

When a new document insert must simultaneously update a full-text search index, a structured column, and a vector embedding, pgvector keeps all writes inside a single transaction. That property matters in regulated audit trails or order management systems where a partial vector update would leave the search index inconsistent. Milvus fits conditions requiring high write throughput (over tens of thousands of inserts per second) and sub-50-millisecond approximate-nearest-neighbor queries on billion-scale datasets, because its log-structured merge-tree design lets it buffer, build, and merge index segments without blocking ingestion. Pinecone suits teams with spiky, unpredictable query loads and zero tolerance for index maintenance; its serverless architecture absorbs bursts, but the cost model punishes sustained high throughput. The system property that governs the choice is not vector count alone, but the write path’s coupling to other storage operations and the query latency distribution the application can tolerate under load.

Applied Patterns: Postgres-adjacent Embeddings and High-Churn Semantic Pipelines

A B2B SaaS platform storing client documents, invoices, and embeddings in Postgres can run a single-table schema where pgvector’s IVFFlat index provides acceptable recall for iterative document search, with no additional infrastructure to monitor. A product that surfaces real-time item recommendations from a stream of user interaction events calls for Milvus. The architecture ingests Kafka events, generates embeddings via a sidecar or inference service, and inserts into Milvus with the HNSW index tuned for a 99th-percentile latency of 30 milliseconds on 50 million vectors. A three-person startup building an LLM agent that summarizes customer tickets and retrieves historical context opts for Pinecone. They never touch an index parameter, never provision pods, and accept that the system’s recall may plateau at 97 percent because they cannot directly tune the index’s efConstruction value for their specific embedding model.

Operational Friction, Index Tuning, and Cost Realities

Pgvector’s HNSW index builds incrementally on inserts, but memory usage grows with graph degree parameters and can compete with the buffer pool, directly impacting transactional performance. The IVFFlat index requires a list of clusters built from a representative data sample; without re-running the build as data distribution shifts, recall decays. Milvus delivers automated index tiering and segment merging, but operators must allocate sufficient memory for both the index and the growing segment buffer during compaction, and misconfiguring the data coordinator can fragment metadata in etcd. Pinecone eliminates that surface area and charges per pod or per million read units; a high-dimension (3,072 dimensions) embedding workload at scale often exposes a steep cost curve compared to self-hosted components when query volume is constant. Teams that abandon self-managed options early lose the ability to optimize index parameters during inference model changes—a non-obvious friction point when migrating from one embedding model to another.

TL;DR

Storing embeddings is simple; indexing them for low-latency, high-recall semantic search architecture requires decisions about data coupling and runtime ownership.
Pgvector couples vector indexes to Postgres, keeping transactional guarantees but competing for memory and I/O with OLTP workloads.
Milvus decouples the vector index into a distributed system that delivers high throughput and configurable indexing strategies at the cost of operational complexity.
Pinecone removes infrastructure management but obscures index internals, making recall tuning and cost control opaque.
The correct option depends on whether embeddings must share a transaction boundary, the acceptable latency distribution under write load, and the team’s capacity to operate stateful search infrastructure.

For backend engineering services covering embedding pipelines, vector database selection, and LLM infrastructure integration, contact BaseStation Private Limited at [email protected].

Engineering Coupling and Ownership: Managed Services Don’t Automatically Simplify

Placement Driven by Transactional Boundaries and Throughput Profiles

Applied Patterns: Postgres-adjacent Embeddings and High-Churn Semantic Pipelines

Operational Friction, Index Tuning, and Cost Realities

Subscribe to Base-Station Engineering Blog