Designing an AI Council Decision System

This was a week-long side project I hacked on with a friend. I work in defense, and for obvious reasons I don’t get to mess around with LLM infra as much as I’d like. So we did a little internal hackathon thing and I went off and built what I kept wishing I had when I’m doing research.

The basic idea is simple: there’s a ridiculous amount of public government and regulatory data that moves markets, and a lot of it barely gets read in an “investment research” way because the volume is insane. Most people are not going to sit there reading 200 FDA updates, SEC stuff, legislative output, clinical trials updates, policy consultations, comment letters, etc, every week.

I wanted tooling that can chew through that firehose and hand a PM something that looks like an actual memo.

Also I wanted an excuse to build a proper multi-agent harness instead of another “ask 5 agents and merge” toy.

Why I care about government paperwork

A lot of my best trades in recent years have come from obscure commodities and policy-driven weirdness, where the edge is literally “I bothered to read the boring documents.”

One example that stuck with me: the EU was banning a bunch of fertilizers. Inside that set there was a fertilizer used for sugar beets where, unlike the others, there wasn’t a clean alternative. I went down the rabbit hole, read way too many government docs, then started digging into the niche industry stuff and found farmers basically saying “we’re screwed without this.”

The market didn’t really move for a while. Then months later you start seeing the actual numbers show up and it suddenly becomes real to everyone at once.

That’s the whole pattern. The information exists, it’s public, but it’s buried in volume and nobody wants to read it. So I wanted a system that can do the first pass at scale and surface “this might matter.”

What the system does

Input is raw text. It might start life as a PDF or HTML, but by the time it hits the council it’s just text plus a timestamp (if we have one).

The output is a strict JSON memo that a PM can scan, sort, and then decide: “this is worth my time” vs “this is noise.”

This is the exact judge output shape it produces:

{
  "signal": "string",
  "affected_assets": ["string"],
  "causal_chain": "string",
  "market_consensus": "string",
  "mispricing_thesis": "string",
  "direction": "LONG",
  "confidence": 7,
  "time_horizon": "3-6 months",
  "key_risks": ["string"],
  "priority": "MEDIUM",
  "rationale": {
    "consensus_view": "string",
    "key_disagreements": ["string"],
    "realism_confidence": 7,
    "anchor_report_label": "Agent C",
    "strongest_evidence_for": "string",
    "strongest_evidence_against": "string",
    "counter_to_own_synthesis": "string"
  }
}

A PM workflow is basically:

Ingest a bunch of documents (or paste one in manually while testing).
The council produces a stack of memos.
The UI lets you scroll them fast. Confidence, priority, affected assets, time horizon, and the causal chain do most of the filtering.
If something looks real, it gets handed off for deeper research (price action, more context, validation, sizing, etc). I only built a thin version of that deeper lane, because the council itself was the interesting bit.

The council, in plain English

I built it as a pipeline where every stage is forced to output machine-readable data, with retries and artifacts so you can see exactly what happened when something goes wrong.

The cast:

A router model that decides FULL_COUNCIL vs SINGLE_AGENT.
A regime classifier that adds a bit of macro backdrop and extra context (short mention because it’s basically “LLM + search + a bit of price/volume context”).
Five personas, each looking at the same input through a different lens.
A judge model that synthesizes the final report.

The five personas:

Agent A: fundamentals and capital loss risk, long-ish horizons.
Agent B: flow, positioning, microstructure, shorter horizons.
Agent C: macro and cross-asset regime (this is the preferred single-agent).
Agent D: catalysts and event sequencing, shorter horizons.
Agent E: structural and secular trend lens, long-ish horizons.

Full Council Execution

Two-stage per agent: free-form reasoning first, then forced JSON. Self-consistency picks one candidate per agent. Judge sees anonymised labels.

What actually happens in the default setup:

Router runs once.
Regime runs once.
Each persona runs 5 samples of reasoning (so 25 reasoning calls).
Each of those samples gets run through a JSON formatter (25 formatting calls).
Each persona does a self-consistency selection (5 selection runs).
Judge runs once.

That’s 58 model calls for a clean full-council run. If formatting starts failing and you hit retries, call count can balloon fast.

The thing that made this work in practice was a boring trick: every persona does two calls.

Let the model think in normal text. Let it ramble, argue with itself, do the “human” part.
Then run a separate formatting call that forces the output into strict JSON.

I tried the “make the model think and format perfectly in one shot” approach. It sucked. You either get worse reasoning because it’s terrified of the schema, or you get better reasoning and garbage JSON. Splitting it fixed that.

Self-consistency is how I deal with randomness

LLMs are non-deterministic. Sometimes they’re brilliant, sometimes they hallucinate, sometimes they latch onto one stupid detail and won’t let go. If you’re doing anything finance-adjacent, you can’t just shrug and say “the model was weird.”

So each persona runs multiple samples and I pick the dominant cluster by:

direction (LONG/SHORT)
time horizon (bucketed)
priority (HIGH/MEDIUM/LOW)

If one sample decides the FDA is secretly banning oxygen and another says “this is noise”, that’s a smell. The consistency selector usually makes the hallucination stand out as the odd one out.

Reliability stuff that actually mattered

The two failures that forced the architecture, in order:

1) Temporal context, because models will accuse your data of being fake

Early on I forgot to inject real temporal context. I fed it a government doc and the model basically said: “this is clearly fictional, the dates are nonsense, this reads like some futuristic Japan fairytale.”

Which was honestly funny, but also the kind of failure that would silently kill your whole pipeline if you didn’t notice.

So every run resolves an “effective source time” and injects a temporal context block into every model call, with explicit rules like “this can post-date your training data and still be real” and “interpret today/yesterday/tomorrow against the effective timestamp.”

That one fix removed a whole class of dumbness.

2) Strict schemas, because downstream systems need determinism

If you want a UI, filtering, sorting, comparisons across time, and the ability to audit later, you need a strict contract.

So everything is schema-gated. If it doesn’t parse, it doesn’t count. That sounds harsh, but it’s the difference between “cool demo” and “something a PM could look at without rolling their eyes.”

The JSON formatting stage has a retry ladder. If formatting fails repeatedly, it falls back to a deterministic “emergency” object that is schema-valid but intentionally low confidence, explicitly saying “formatting failed, manual review needed.”

Same idea for the judge. If the judge fails to produce valid output, there’s a deterministic judge fallback that synthesizes from whatever persona outputs are still parseable.

3) Artifacts, because “the model was weird” is not a diagnosis

I store basically everything during dev: raw reasoning, structured candidates, selection metadata, errors, truncation markers, router decision, temporal context, judge input, judge output, whether emergency fallbacks fired, etc.

That does two things:

When you’re debugging, you can see exactly where it degraded.
If a PM is about to put on size, you can audit every step instead of trusting vibes.

This also made model testing way less annoying. You can swap models, change retry counts, change context windows, and actually see where things start breaking.

Single agent fast path

Most government updates are noise. Spending five personas and a judge on a nothingburger is a waste of money and time.

So there’s a single-agent mode. Router chooses it only when it’s extremely confident (threshold is high on purpose). And if the router errors, it defaults to full council.

Single agent fast path

Simple Signal Execution

This exists because most government updates are noise. Don’t spend five agents debating a nothingburger.

In SINGLE_AGENT, I run the macro persona (Agent C) and still emit the exact same final schema. No judge call. The rationale text is framed as “single-agent path; no cross-agent consensus computed” so nobody mistakes it for consensus.

It’s deliberately boring. It’s supposed to be cheap and fast.

The part I actually learned building this

Building multi-agent systems is mostly about all the stuff people skip in demos.

Models fail in predictable ways. Treat them like unreliable dependencies and design around it.
Separate “thinking” from “formatting” if you care about both quality and determinism.
If you don’t store artifacts, you don’t have an operational system. You have a slot machine with logs.
Self-consistency isn’t magic, it’s just a practical way to reduce variance and catch hallucinations without pretending you solved alignment.

Tradeoffs

This is a side project. It’s opinionated and a bit brute force.

Full council is expensive. That’s why routing exists.
The service logic is concentrated and could be cleaner.
Emergency fallbacks protect uptime but obviously reduce quality. That’s why the audit trail matters.

Still, the core thing works: you can feed it a government source, it produces a structured memo, and it’s built in a way where you can actually tell what happened when something breaks.

That’s the bar.