Tearing out the vote: an architecture post-mortem
We’ve written about the principle: the engine produces clues, never verdicts. This is the story of how we learned it — by building the wrong thing first, shipping it, and then spending real days tearing it back out. Post-mortems usually cover outages. This one covers an idea that worked exactly as designed, which was the problem.
The design that felt obvious. Three of our detection families were live — a digit-distribution test, a composite misstatement score, a distress measure — and the question arrived on schedule: so which companies do we look at? The obvious answer: let the detectors vote. Two out of three flags, mark the company. We shipped a tidy boolean at the end of the pipeline, and for a week it felt like progress. The list it produced even looked reasonable. Plausibility is the most dangerous property a wrong design can have.
The first crack: thresholds from nowhere. To vote, a continuous measurement must first become a yes/no — which means a cutoff. Ours came from the literature’s conventions and a Tuesday’s judgment. But a misstatement score’s meaning shifts with sector and size; a distress measure means one thing for a software company and another for a leveraged retailer. A single global threshold doesn’t resolve that context — it deletes it, invisibly, before anyone downstream can object.
The second crack: votes assume equals. Two-of-three treats every detector’s yes as the same size. Yet one detector might be strong evidence precisely where another is weak; a flag on thin data is not a flag on rich data; and silence — a detector finding nothing — carries information a vote count simply discards. We kept proposing patches: weight the votes, add per-sector thresholds, special-case the financials sector. Every patch made the boolean smarter and the system worse, because every patch buried more judgment deeper inside a layer that would never see an outcome to learn from.
The realization, stated plainly: the ensemble wasn’t aggregating evidence — it was destroying evidence, converting rich continuous measurements into a single bit at the earliest possible moment, using assumptions nobody could later inspect, tune, or unlearn. The vote wasn’t a judgment layer. It was judgment fossilized.
The teardown. Removing a column is easy; removing a verdict is not, because verdicts metastasize — downstream consumers had started keying on the boolean. So the removal became its own discipline, and it’s the same one we still use: mark the column deprecated, migrate every consumer to the continuous values underneath, verify nothing reads it, and only then drop it. The order matters. We’ve since retired other columns the same way, and the first teardown is why the procedure exists.
What stands in its place is the separation we now treat as constitutional: detectors emit measurements with context, a distinct downstream layer holds all judgment, and that layer is built to learn from what actually happens. The vote had answered “who do we look at?” with arithmetic. The honest answer is: that’s a decision, and decisions belong where the consequences can reach them.
We keep the old design written down, not out of nostalgia, but because the failure mode is a default. Every new detector we add, somebody — sometimes the person writing this — feels the pull to ship “just a simple flag” at the end of the pipe. The post-mortem is how we remember that we already tried that, it already looked reasonable, and it was already wrong.