The Signal Machine
Can a PM build the intelligence infrastructure an entire org runs on?
Every product organization has the same problem, and almost none of them have solved it. Customer signal arrives continuously — through support channels, through customer success notes, through win/loss calls, through in-app surveys, through the feedback tool your customers use when something frustrates them enough to write it down. It accumulates in different systems, owned by different teams, read by whoever happens to be paying attention that week. The patterns that exist across multiple sources, across multiple customer segments, across multiple quarters — the kind that only emerge when you put everything together — mostly go undetected. I built the system that finds them.
How the problem actually works
A product organization spread across multiple areas doesn’t have a unified view of customer signal. Each PM reads what flows to them. A PM owning workflows sees workflow feedback. A PM owning reporting sees reporting feedback. Neither of them sees that the same underlying problem — say, difficulty building cross-functional visibility — is showing up in their area and five others simultaneously, through different surfaces, in different language, from different customer segments.
That’s not a failure of the individual PMs. It’s what happens when signal is distributed and synthesis is manual. The problems that affect a single area get noticed. The problems that live between areas, or that manifest differently across areas, don’t.
What I wanted to build was a system that reads everything, finds the patterns, and delivers the findings in a form each PM can immediately evaluate. The AI does the aggregation and the synthesis. The PM does the judgment — whether a spec is worth pursuing, whether it’s already been killed before, whether the evidence is strong enough.
The heavy lifting moves to the system. The work that requires PM judgment stays with the PM.
The architecture, as it emerged
The system didn’t arrive fully formed. The architecture I ended up with is the result of a series of design decisions that each made sense only after the previous layer was clear.
Four decisions that shaped the system
1. Parallel collection — Extractors run simultaneously, each queries one source. Latency is bounded by the slowest source, not the sum.
2. HDBSCAN clustering — Density-based, no fixed cluster count. Produces noise points by design — promoting noise to specs was the $19k mistake.
3. Three-tier librarian routing — Kill registry, merge, or new. Every cluster passes through this gate before entering the pipeline.
4. Hard cost cap — Not a config value, an architectural constraint. Every LLM call checks accumulated spend. Pipeline halts when hit.
Collection. Five agents run in parallel, each querying a different source. NPS survey data. Customer success activity notes. In-app feedback. Win/loss signals. Internal product feedback from employees using the product daily. Each collector extracts structured signals using a fast, cheap language model — batching ten records at a time, generating semantic embeddings for each signal, writing everything to a PostgreSQL database with vector extension. When all five finish, the next stage starts.
The extraction step is where signal diversity matters most. The same customer problem expressed in NPS language sounds different from the same problem expressed in a CSM note, which sounds different from how it shows up in a win/loss debrief. The model extracts the signal type — feature request, pain point, churn reason, competitor mention, escalation — and the customer context. The synthesis stage can then find that all three are the same problem.
Synthesis. This is where organizational intelligence actually happens.
HDBSCAN — a density-based clustering algorithm — runs on the full embedding matrix. It groups signals that are semantically similar, regardless of which source they came from or which customer segment they represent. A group of 20 signals from different sources, different customers, different time periods, all pointing at the same underlying problem becomes one cluster.
After clustering, a librarian agent routes each cluster through three tiers. First: has this been explicitly killed before? The kill registry holds every spec a PM has reviewed and decided not to pursue, along with the reasoning. New clusters that match an existing kill entry get suppressed unless the new evidence is materially different — a judgment call the model makes, one cluster at a time. Second: does this reinforce something already in progress? A cluster that matches an existing spec in the tracking system merges into it, strengthening the evidence base without creating a duplicate. Third: genuinely new. The cluster becomes a new spec and enters the pipeline.
Only the top clusters by signal count proceed downstream. The rest are stored — they may become relevant later as more signal accumulates.
The downstream pipeline. Each spec that enters the pipeline moves through six agents sequentially.
Problem decomposition builds the JTBD profile — what outcome is the customer trying to achieve, what’s preventing them, how important is it and how poorly is it currently served. This stage uses the full company strategy context to anchor the analysis in current direction.
The debate stage is structurally the most interesting. An advocate agent builds the strongest possible case for pursuing this opportunity. A skeptic agent builds the strongest possible case against. A judge — running on a more capable model, the only stage in the pipeline that does — evaluates both sets of arguments and produces a RICE score and priority rating. The judge’s job is exactly the kind of nuanced prioritization call where reasoning depth matters: weighing competing evidence, assessing confidence, accounting for strategic alignment against implementation cost.
Solution research, org review, and spec writing follow. The org review stage collapses what would have been eight separate calls — sales, marketing, engineering, finance, legal, customer success, support, professional services — into one structured call returning all eight perspectives simultaneously. The spec writer uses a fast, cheap model to fill a structured document template with the outputs of all prior stages.
The router classifies the spec to the right PM area based on strategic pillar and initiative, moves the Google Doc to the correct folder in shared Drive, creates the tracking row, and sends a Slack notification to the assigned PM.
That’s the full chain. Signal in. Spec out. Weekly.
Building it as a PM, with Claude Code
Every component was designed before it was built. The extraction pipeline — how signals get batched, how the failure rate triggers an abort, what happens when the model returns a signal type that doesn’t exist in the schema — was written as a design document first. The database schema, the deduplication logic, the session management patterns — all of it went through a spec before Claude Code wrote the implementation.
That discipline wasn’t optional. When I skipped it, the code worked but made different tradeoffs than I intended. The most consequential example: the debate agent was loading every clustered signal in the entire database to build customer context, instead of loading only the signals for the spec it was evaluating. It passed every check. The customer ARR figures it would have produced for each spec would have spanned the entire customer base rather than the specific cohort that raised the problem. Found by reading the code carefully during a systematic architecture review, not by a test.
The infrastructure is AWS throughout — Step Functions orchestrating the stages, ECS Fargate running the agents on ARM64, Aurora PostgreSQL Serverless for the data layer, SQS between every stage. The entire thing is deployed from Terraform (the production repo is internal). A PM wrote and manages all of it.
What I saw in the run logs
The first full test run collected over 10,000 signals from two sources. HDBSCAN ran with a minimum cluster size of three signals and produced 8,292 clusters.
332 of those were real — groups of signals representing genuine, repeated customer problems. The other 7,960 were noise points that the algorithm couldn’t assign to any cluster. A single line of code I had written to avoid losing signal promoted every noise point to its own singleton cluster. Every singleton entered the downstream pipeline as a spec.
I looked at the estimated cost from the run logs before the downstream agents had processed more than a few dozen specs. At the current rate, across 8,292 specs, the run would cost approximately $19,000.
I stopped the execution.
The singleton promotion was the wrong product decision dressed as a technical one. A single isolated signal is not a product opportunity. A product opportunity is a pattern — multiple independent customers, different contexts, the same underlying problem. Promoting noise to specs doesn’t surface more signal. It buries the real signal in thousands of documents no one will read.
Raising the minimum cluster size to fifteen signals, dropping all noise points, and capping the total clusters that proceed to the downstream pipeline reduced 8,292 clusters to roughly eighty to one hundred per run. That’s the right number. It’s a number a PM organization can actually process.
The cost cap followed as a hard architectural constraint. Every language model call now checks accumulated estimated spend against a configurable limit. When the pipeline hits the cap, it stops the Step Functions execution and halts the running agent. Test runs cap at $20. The cap isn’t a safety net — it’s a product decision about what a run should cost.
First run versus every run after
The system makes a hard distinction between a first run and every run after it.
The first run is a historical backfill. Depending on how far back you pull, you’re processing months of signal across all sources. It’s expensive by design — you’re seeding the intelligence layer. The derived tables (clusters, specs, kill registry) get wiped and rebuilt from the current batch. The signals themselves are preserved — they’re the raw material. Everything derived from them is regenerated.
Weekly runs are different in kind. They collect seven days of new signal, cluster only that, and run the librarian’s three-tier routing against everything that’s already in the system. A new cluster that closely matches an existing spec in review gets merged in — more evidence for a problem someone is already looking at. A cluster matching the kill registry gets suppressed. Genuinely new patterns create new specs.
The system improves with use. The kill registry grows with PM decisions. The existing spec pool accumulates history. The routing gets better as the database fills with curated judgment calls. This is what makes it infrastructure rather than a tool — it compounds.
What this is actually for
The specs this system produces are starting points, not finished work. A PM who receives a spec in their queue this Monday gets a structured document with a JTBD profile, an opportunity score, a debate summary with RICE components, an org review across eight functions, and a solution proposal. They didn’t ask for it. They didn’t know the pattern existed until the system surfaced it.
Their job is judgment — is this real, is it worth pursuing, does it fit the current direction. The AI handled the aggregation, the synthesis, the framing. The PM handles the decision.
That’s the organizational intelligence layer that didn’t exist before. Not a tool any individual PM uses. A system the entire organization benefits from without changing how they work. The patterns that live between product areas, that no single PM would have seen from their vantage point, surface automatically.
The system doesn’t decide what to build. It decides what’s worth looking at.
None of this replaced any PM’s judgment. The PM who receives a spec on Monday still does the real work — evaluating the evidence, pressure-testing the framing, deciding whether the opportunity is real and whether now is the right time. What changed is what lands on their desk. Not the loudest request from the last customer call. The patterns the data was pointing to all along, distilled, ranked, and waiting for a decision.
The cost cap story is worth dwelling on. A $19,000 potential run wasn’t a system failure — it was the system behaving exactly as designed, just with parameters that hadn’t been thought through. The recognition came from reading run logs, not from an alert. The fix was a product decision: what constitutes a signal cluster worth pursuing, and what’s noise. Raising the minimum cluster size from three to fifteen wasn’t an optimization. It was a definition. A product opportunity requires evidence from multiple independent customers across time. One signal isn’t a pattern. Fifteen probably are.
Building that constraint into the architecture as a hard cap — not just a configuration value — is the kind of decision that separates infrastructure from prototypes. Prototypes handle the happy path. Infrastructure handles what happens when the numbers are wrong.