AutoAgent: Self-Optimizing Finance AI Case Study
AutoAgent is a self-optimizing eval loop that sits on top of an existing AI agent and continuously improves the agent’s prompt contract — its SKILL.md — without any human intervention.
The core idea is borrowed from software testing: instead of writing a new test suite for every model change, you maintain a golden set of verified query/answer pairs, run the live agent against them on every iteration, score the outputs with a dual (rule + LLM) scorer, and let a meta-LLM propose targeted edits to the prompt contract when scores are low. If an edit improves the average score it is committed; if not, it is discarded. The loop runs until a plateau is hit or a target threshold is reached.
while not plateau:
responses = agent.run(golden_queries) # live calls
scores = dual_scorer(responses, goldens) # hard + LLM
if avg(scores) > best:
skill_md = meta_llm.propose_edit(skill_md, failing_cases)
commit()
snapshot()
else:
discard()
no_improvement += 1
This makes prompt engineering measurable, reversible, and automated — a property that becomes essential when the agent has to answer questions across multiple live data platforms simultaneously.
The Multi-Data-Platform Context
Modern enterprise finance intelligence is not a single-source problem. "How did operating expenses trend against headcount this quarter?" touches at least three systems. Each platform has its own schema quirks, authentication flow, and output format expectations. The agent from modern agentic systems stitches them together via a Skills pattern.
Solution — AutoAgent Eval Loop
AutoAgent wraps any individual skill’s live agent in an automated hill-climbing loop:
- Golden Set — 20 hand-curated query/answer pairs per skill, covering the domain’s key report types.
- Dual Scorer — every live response is scored on two axes (regex extraction + LLM semantic check).
- Meta-LLM Edit — Bedrock Sonnet receives the current SKILL.md and failing cases to propose an edit.
- Keep-or-Discard — if the new average exceeds the best so far, it is committed.
Lessons Learned
- The dual-scorer is fast and cheap. Haiku judge calls cost < $0.001 per scored case.
- A golden set of only 20 cases was enough to detect regime changes in agent output.
- Snapshotting every kept
SKILL.mdversion makes regression rollback trivial.