Interview demo · internal explainer

Closing Room

An AI reconciliation copilot — and everything you need to understand it, explain it, and defend it in an eBay “AI builder / internal financial tools” interview.

What this document covers 1 · The 60-second version 2 · Why this demo (your positioning) 3 · The problem: what reconciliation actually is 4 · The big idea (the one sentence to memorize) 5 · How it works — architecture, layer by layer 6 · The finance model, explained from scratch 7 · The break catalogue (every kind of problem) 8 · The five guardrails — and why each exists 9 · How the AI part really works 10 · How to run the demo live (talk track) 11 · Hard interview questions + strong answers 12 · What’s built vs. what’s left 13 · Glossary (finance + AI terms)

1 · The 60-second version

eBay pays millions of sellers. Someone in finance has to prove that what the books say a seller is owed equals what the payment processor settled equals what actually left the bank. That three-way check is called reconciliation. When the three don’t agree, that’s a break, and a human currently hunts it down by hand in spreadsheets.

Closing Room automates the hunt. A deterministic program (plain code, no AI) matches the millions of clean rows and isolates the handful that don’t reconcile. Only those go to an AI, which explains each one in plain English, cites the exact records it used, and drafts a correction — but never posts anything; a human approves. Crucially, the AI’s “confidence” and “is this grounded in real data” are not the AI’s own claims — they are measured by separate code. That is the whole pitch: in finance, you don’t trust an AI’s word; you verify it mechanically.

2 · Why this demo (your positioning)

You’re a designer who taught yourself to ship real autonomous AI systems. That’s rare and valuable — most “AI builders” can’t design, and most designers can’t build agents. Don’t apologize for “no formal training.” The field moves faster than any curriculum; what matters is that you can take a fuzzy problem, wire an LLM + tools + data into something reliable, and ship it. You’ve done exactly that repeatedly.

The one line to internalize: “I’m a builder who designs. I’ve spent the last year shipping autonomous AI systems end to end — I just did it on my own problems instead of a company’s.”

Closing Room is chosen specifically because it lets you say something a finance org deeply cares about: an AI that confidently invents a number is worse than no AI at all. The entire design answers that fear. That’s why it beats a generic “chat with your data” dashboard — it proves judgment about reliability, not just wiring.

3 · The problem: what reconciliation actually is

Imagine you sell on a marketplace. Over a two-week period you make sales, some get refunded, the marketplace charges fees and ad costs, and it withholds sales tax on your behalf. At the end of the period the marketplace owes you a single net payout. Three separate systems record this journey:

Ledger

The marketplace’s own books: “for this period, we owe seller X $Y net” (after fees, refunds, tax, reserves).

Processor

The payment company (Stripe/Adyen): “we settled $Z to that seller,” minus its own processing fee.

Bank payout

The actual money movement: “a payout of $W hit the bank on this date,” often bundling many settlements.

Reconciliation = proving Ledger = Processor = Bank for every seller, every period. At eBay scale this is enormous, repetitive, and unforgiving — a few cents wrong across millions of rows is both a control failure and, at aggregate, real money. The tedious part isn’t the millions that match; it’s finding and explaining the few that don’t.

4 · The big idea

Use deterministic logic for what must be auditable. Use the LLM only where judgment and language add value. And never let the model self-certify — measure its trustworthiness with separate code.

Breaking that down:

Deterministic core. The matching (does A equal B equal C?) is plain arithmetic. It’s reproducible and auditable — you can point at the exact rule. An auditor will never accept “the AI decided these matched.”
LLM at the edges. The LLM does the thing it’s genuinely good at: reading messy context and writing a clear, human explanation of why a specific break happened, plus proposing a fix.
Computed trust. The two numbers a skeptic will attack — “how confident are you?” and “did you make that up?” — are produced by verifier code, not by the model. This is the part most people get wrong.

5 · How it works — architecture, layer by layer

┌─────────────┐ ┌──────────────┐ ┌───────────────┐ ┌──────────────┐ ┌──────────┐ │ 1. Seed │──▶│ 2. Matcher │──▶│ 3. AI triage │──▶│ 4. Validator │──▶│ 5. UI │ │ synthetic │ │ deterministic │ │ claude -p │ │ + confidence │ │ Next.js │ │ JSON data │ │ buckets │ │ (offline) │ │ (code, not AI)│ │ │ └─────────────┘ └──────────────┘ └───────────────┘ └──────────────┘ └──────────┘ ground-truth matched / explanation + grounded? conf? served from labels baked in reconciling / cited ids + → scored + cached a static cache exception proposed fix (no live AI call)

Seed — synthetic data generator. A seeded (reproducible) script writes three feeds of fake-but-realistic payout data, and deliberately plants a known set of problems, each labelled with the truth. That label lets us later score whether the matcher and AI got it right.

Matcher — the deterministic core. Pure functions, no AI. For each seller-period it computes “expected vs. actually paid” and sorts every case into one of three buckets (below). This is the auditable heart.

AI triage — offline. For only the exception bucket, it asks Claude (via the claude -p command line) to explain the break and propose a resolution. This runs while building, not during the demo.

Validator + confidence — code, not AI. Every AI answer is checked: do the cited IDs exist? do the cited amounts match the source? does the proposed journal entry balance? A confidence score is computed from these signals. Results are cached to a JSON file.

UI — Next.js. The website reads the data + the cached triage and renders it. It makes zero AI calls at runtime, so it can’t lag, rate-limit, or cost money mid-demo.

Why the three buckets matter (step 2): the finance-literate distinction that makes this credible is separating a reconciling item (a difference that is explained and needs no action, like a timing lag) from an exception (a difference that needs a fix or investigation). A tool that flags every difference as a problem “cries wolf” and gets ignored. Ours proves it knows the difference.

6 · The finance model, explained from scratch

You don’t need an accounting background — here are the only concepts you need, in order.

Money is stored as whole cents (“minor units”)

Never store money as 12.34 (a floating-point number) — computers can’t represent it exactly and rounding errors creep in. We store 1234 (integer cents) and format to “$12.34” only for display. In a reconciliation tool, a one-cent drift is the exact bug you’re hunting, so you must be exact everywhere.

The payout identity (the equation being checked)

For a seller in a period, the ledger computes:

netOwed = grossSales − refunds − marketplaceFees − adFees − facilitatorTaxWithheld − reserveHeld + reserveReleased

Then reconciliation asks: does netOwed equal what the processor settled, which equals what the bank paid out? Two of those terms deserve explanation because they’re the classic sources of confusion an eBay person will look for:

Reserve (held / released)

Marketplaces hold back a slice of your money for a while as protection against future refunds/chargebacks, then release it later. So a payout can be larger than the period’s sales because an old reserve was released — surprising, but correct. That’s a reconciling item, not a break.

Marketplace-facilitator tax

eBay collects sales tax and remits it to the government on the seller’s behalf, so it’s withheld from the seller’s payout. If you forget this line, your “what the seller is owed” is wrong — and an eBay finance person would notice instantly.

One-to-many is real

A single bank payout usually bundles many settlements. So matching isn’t always 1-row-to-1-row; the tool handles many settlements → one payout, and shows a variable number of source records rather than a rigid three-column layout.

7 · The break catalogue

The generator plants these deliberately. Notice the third column — the same “difference” can demand very different actions, and that nuance is the credibility.

Item	Bucket	What it is	Right action
Sub-cent rounding	reconciling	Off by 1–2¢ from currency math	No action (within tolerance)
Reserve released	reconciling	Old held-back money released this period	No action (explained)
Timing lag	reconciling	Settlement lands next window, not yet paid	No action; escalate if it ages out
Fee mismatch	exception	Charged 2.9% vs. the 2.5% contract	Adjusting journal entry (correct the fee)
FX remeasurement	exception	Exchange rate moved → real gain/loss	Adjusting entry to an FX Gain/Loss account
Duplicate settlement	exception	Processor paid the same period twice	Adjusting entry (reverse the duplicate)
Missing settlement	exception	Ledger owes, processor never settled	Dispute case — NOT a journal entry
Chargeback pending	exception	Payout short by a late chargeback	Dispute case until resolved
Unknown / ambiguous	exception	No rule matches deterministically	Route to a human

Why “dispute case ≠ journal entry” is the sharpest point: a confirmed error (a wrong fee) can be corrected with an accounting entry. But an unconfirmed item (a payment that hasn’t arrived, a chargeback still in dispute) must not be booked as a correction yet — doing so recognizes money you don’t actually know is owed. A tool that proposes a journal entry for everything is naive; ours proposes the correct type of action per case.

8 · The five guardrails — and why each exists

① Deterministic matching

The match/no-match decision is plain code, reproducible and inspectable. Why: auditors and controllers need to point at an exact rule. “The AI thought so” is not an acceptable basis for a financial control.

② Grounding validator (4 layers)

Every AI answer is mechanically checked before it’s shown: (1) every cited ID actually exists in the data; (2) every cited amount equals the real source value; (3) any proposed journal entry balances (debits = credits) and uses a valid account; (4) the correction’s size matches the measured difference. Why: this converts “the AI cites its sources” from a claim into a tested fact. If the model hallucinates an ID or an unbalanced entry, the validator catches it and the UI shows it in red. We literally plant one bad case to prove the catch works — in the live demo it shows Caught ✕ at confidence 0.35.

③ Computed confidence (not self-reported)

The confidence number is calculated from real signals — does the arithmetic reconcile exactly? did the AI’s root cause match what the deterministic matcher independently suspected? does the entry balance? is the case unambiguous? Why: if you ask an LLM “how confident are you?”, the number is poorly calibrated theatre. A number derived from verifiable properties is defensible. This is the single most impressive answer you can give when a sharp interviewer probes.

④ Human-in-the-loop approval

Nothing posts automatically. Proposed corrections sit in an Approve/Reject queue with an audit trail. Why: it mirrors real financial controls (segregation of duties, SOX). The AI is an assistant, not an actor.

⑤ Cached, offline triage

The AI runs while building; the live site serves the cached, already-validated results. Why: two reasons — (a) the demo can never lag, rate-limit, or cost money while you’re screen-sharing; (b) it’s deterministic, so “do it again” always behaves. This is itself a signal of engineering judgment about demos and cost.

9 · How the AI part really works

People assume “AI feature” means “call an API at runtime.” Here it deliberately doesn’t. The flow:

A build-time script loops over each exception and calls Claude through the claude -p command (your plan auth — $0, no API key), asking for a strict JSON answer.

It runs Claude from an isolated temporary folder so it ignores your other Claude settings, with input redirected so it can’t hang. The answer is stripped of markdown, JSON.parsed, and validated against a strict schema (zod). If it’s malformed, it retries up to 3×, then throws loudly — it never silently skips an exception.

Each valid answer is scored by the validator + confidence code, then written to a cache file that gets committed to the repo. That file (with a provenance stamp: model, timestamp, input hash) is what the website reads.

Real result on the current data: 6 of 7 exceptions grounded, with the one planted hallucination correctly caught. The 3 confirmed breaks got balanced adjusting entries; the unconfirmed ones got dispute cases.

10 · How to run the demo live (talk track)

Open closing-room.thomaspeng.ca. Say: “This reconciles marketplace seller payouts across three systems.” Point at the stat rail — match rate, count of exceptions, dollars at risk.

The top exception is already open. “A deterministic matcher found this; the AI explained it. Notice it cites the exact settlement IDs and amounts — and this ‘Grounded ✓’ isn’t the AI’s claim, it’s a separate validator confirming every figure traces to a real record.”

Point at the proposed entry: “It drafts a balanced correcting entry, but nothing posts — a human approves. For the unconfirmed breaks it opens a dispute case instead of a journal entry, because you don’t book money you can’t confirm.”

Go to the Eval view (once built): “Here’s a deliberately planted hallucination. The validator catches it — flagged red, low confidence. That’s the whole thesis: in finance you verify the AI mechanically, you don’t trust it.”

Backstop if they push on production reliability: “I’ve run this pattern with real stakes — a trading bot with hard risk caps and a self-healing monitor that caught and recovered from a real incident. Same principle: guardrails around an unreliable component.”

11 · Hard interview questions + strong answers

“How do you know the AI isn’t just making up numbers?”

“I don’t trust it — I verify. A separate validator checks that every cited ID exists, every amount matches the source, and every proposed entry balances. It runs as a test. I even plant a bad answer to prove the catch fires. In the demo that one shows caught, in red.”

“That confidence score — is it the model’s?”

“No. Model self-confidence is uncalibrated. Mine is computed from deterministic signals: does the arithmetic reconcile, did the model’s root cause agree with what the rule-based matcher independently suspected, does the entry balance, is the case unambiguous. It’s a property of the answer, not the model’s opinion of itself.”

“Why not just have the AL do the matching too?”

“Because matching has to be auditable and reproducible. An auditor won’t accept ‘the model decided these reconcile.’ So the deterministic core does matching; the LLM only explains and proposes, where language and judgment actually help.”

“How would this scale to eBay’s volume?”

“The matcher is the cheap part — it’s just arithmetic over rows, so it moves to a batch job over the data warehouse. The LLM only ever touches the exception tail, which is tiny relative to total volume, so cost stays bounded. I’d cache and only re-triage when an exception’s inputs change — the provenance hash already supports that.”

“What happens when the model returns garbage?”

“It retries with a strict schema, then throws loudly — it never silently drops an exception. Silent failure in a financial control is the dangerous outcome, so I made the pipeline fail visibly instead.”

“What are the limits / what would you do next?”

“It’s synthetic data and one payout flow. Next I’d wire real warehouse queries, add more break types, and add a feedback loop where an analyst’s Approve/Reject decisions tune which cases auto-clear. The architecture — deterministic core, verified LLM edge, human gate — stays the same.”

12 · What’s built vs. what’s left

Done & deployed

Money/types/chart-of-accounts foundation
Seeded synthetic data (payout periods, reserves, tax, 10 planted items)
Deterministic 3-way matcher (100% bucket accuracy vs. truth)
Computed confidence + 4-layer grounding validator
Offline claude -p triage, zod-gated, cached
Next.js UI, live at the URL, with the hero auto-expanded
21 passing tests; deployed via systemd + Caddy

Remaining

Eval panel — surface the caught hallucination as proof (high value)
README case-study — the artifact that survives a phone screen
Batch B — the “do it again with different data” moment
Approve/Reject wiring + audit log
Polish pass; optional variance tab + cinematic animation

13 · Glossary

reconciliation	Proving two or more records of the same money agree.
3-way match	Checking ledger = processor = bank for each item.
break	A case where the records don’t agree.
reconciling item	A difference that is explained and needs no action (e.g. timing).
exception	A difference that needs a fix or investigation.
settlement	The payment processor’s record of paying a seller.
payout	The actual bank money movement (may bundle settlements).
reserve	Money held back temporarily as risk protection, released later.
facilitator tax	Sales tax the marketplace collects & remits for the seller.
adjusting entry	An accounting correction with balanced debits and credits.
dispute case	A follow-up for an unconfirmed item (no booking yet).
chart of accounts	The fixed list of accounts entries can post to.
grounding	Checking the AI’s claims trace to real source data.
minor units	Money as whole cents (integers) to avoid rounding drift.
claude -p	Running Claude non-interactively from the command line.
zod	A library that validates data matches an expected shape.
SOX	Sarbanes–Oxley — the law behind strict financial controls.

Closing Room · internal explainer · generated 2026-07-02 · demo at closing-room.thomaspeng.ca · repo /home/tpeng/closing-room