PostgreSQL at Scale: What I've Learned Running It in Production

I've run PostgreSQL in production for over a decade across healthcare eligibility platforms, financial systems, and migration engines. Some of these systems process hundreds of millions of records. Here's what I've learned about running Postgres at scale — the stuff that's hard to find in the documentation.

Indexing Is Your Job, Not Postgres's

PostgreSQL's query planner is excellent, but it can only work with the indexes you give it. The number one performance issue I see in production Postgres databases is missing or poorly designed indexes.

Rules I follow:

Index every foreign key. PostgreSQL doesn't do this automatically (unlike some databases). Missing FK indexes cause devastating performance on joins and cascading operations.
Use partial indexes. If you frequently query WHERE status = 'active' and only 10% of rows are active, a partial index on status WHERE status = 'active' is dramatically smaller and faster than a full index.
Composite indexes: column order matters. Put the most selective column first. A composite index on (status, created_at) is useless for queries that only filter on created_at.
Monitor unused indexes. Every index slows down writes. Use pg_stat_user_indexes to find indexes that aren't being used and drop them.

VACUUM Is Not Optional

Postgres uses MVCC (Multi-Version Concurrency Control), which means old row versions accumulate over time. VACUUM cleans them up. If VACUUM falls behind, your tables bloat, your queries slow down, and eventually you hit transaction ID wraparound — which is as bad as it sounds.

For high-throughput systems:

Tune autovacuum aggressively. The defaults are conservative. For tables with high write volume, I lower autovacuum_vacuum_threshold and increase autovacuum_vacuum_scale_factor.
Monitor dead tuples. If n_dead_tup in pg_stat_user_tables is consistently growing, autovacuum isn't keeping up.
Consider table partitioning. For time-series data or tables that grow unboundedly, partitioning makes VACUUM more efficient because it operates on smaller chunks.

Connection Pooling Is Mandatory

PostgreSQL creates a new process for every connection. At scale, this becomes expensive. I've seen systems crash because they exhausted the connection limit, not because of query load.

Use PgBouncer or a similar connection pooler. Configure it in transaction pooling mode for maximum efficiency. Set your application's pool size to match your actual concurrency needs, not some arbitrary large number.

The Migration Engine

One of the most demanding PostgreSQL systems I built was a DICOM/PACS migration engine that processed 1.8 million patient studies and 600 million+ files. The database had hundreds of millions of records tracking the state of every file through the migration pipeline.

Key decisions that made this work:

Partitioning by migration batch. Each batch of studies was a separate partition. This kept individual partitions manageable and allowed us to drop completed partitions cleanly.
COPY instead of INSERT. For bulk data loading, COPY is orders of magnitude faster than individual INSERT statements.
Deferred indexing. Drop indexes before bulk loads, reload, then recreate indexes. The total time is dramatically less than maintaining indexes during the load.
Read replicas for reporting. Production queries ran against the primary. Reporting and monitoring queries ran against a streaming replica. This kept write performance consistent.

Operational Wisdom

A few things I've learned the hard way:

Test your backups by restoring them. A backup you've never restored is a backup that might not work. Schedule regular restore tests.

Log slow queries. Set log_min_duration_statement to capture queries above a threshold. Review them weekly. This is how you find performance regressions before users notice.

Upgrade regularly. Each major PostgreSQL release brings meaningful performance improvements. Don't run a version that's three releases behind because upgrading feels risky. The risk of not upgrading is worse.

PostgreSQL is an extraordinary database. But like any powerful tool, it rewards expertise and punishes neglect. Invest in understanding it deeply, and it will serve you well at any scale.