The 33x lesson: bulk loading at filing scale

[ /engineering ] · 2026-06-13 · 3 min read

Some lessons you read about and nod. Others cost you a night. This one cost us a night, and it’s the reason a one-line rule now sits in the documents that govern how we build: any bulk ingestion above ten thousand rows goes through PostgreSQL’s COPY. No exceptions, no “just this once.”

Here’s the night. We were bootstrapping our SEC filings index — the metadata spine for every analysis downstream: millions of filings across roughly 7,900 active filers. The script was honest, well-tested, and row-by-row: prepare an INSERT, bind parameters, execute, next. It had worked fine in every smoke test, because smoke tests are small. At full scale, the arithmetic turned on us. Each round trip carries fixed costs — network latency, parse, plan, WAL flush — that don’t amortize when every row pays them alone. We watched the ETA climb into days and did what you do at that point: stopped pretending, killed the job, and rewrote.

COPY FROM STDIN changes the shape of the problem. One statement, one stream; rows flow as tab-separated text and the fixed costs get paid once per batch instead of once per row. In our environment the throughput gap was roughly 30x — the difference between an afternoon and a lost week. The number will vary with your hardware and your WAL settings; the shape of the curve will not.

The rewrite itself held the real lessons, because COPY is faster and stricter:

Escaping is now your job. SQL parameter binding quietly protected you from tabs, newlines, and backslashes living inside your data. A TSV stream protects you from nothing — one unescaped newline inside a filing title and every subsequent row is shifted into the wrong columns, silently. We wrote a small escaping function, then wrote tests for the escaping function, and that ratio of caution-to-code felt exactly right.

Failure semantics flip. Row-by-row, a bad record fails alone; you log it and move on. In a COPY batch, one malformed row can fail the stream. So the pipeline became: validate into a clean staging stream first, COPY the stream, then reconcile counts — rows offered, rows loaded, rows quarantined — before anything downstream gets to read.

Load into a side table, then publish. We COPY into a staging table and finish with a single transactional INSERT ... SELECT into the table of record. The expensive part happens off to the side; the visible part is one atomic move. If anything looks wrong at reconciliation, nothing downstream ever saw it.

The deeper lesson isn’t about Postgres. It’s that scale changes which costs dominate, and code that’s correct at a thousand rows can be wrong — operationally, economically wrong — at four million. We could have read that in a textbook. We did read it in a textbook. It became real the night the ETA said “Thursday.”

Now it’s a law in our engineering documents, which is where lessons go so you only pay for them once.