Back to Home
Case Study

AutoAgent: Self-Optimizing Finance AI Case Study

7 min read

AutoAgent is a self-optimizing eval loop that sits on top of an existing AI agent and continuously improves the agent’s prompt contract — its SKILL.md — without any human intervention.

The core idea is borrowed from software testing: instead of writing a new test suite for every model change, you maintain a golden set of verified query/answer pairs, run the live agent against them on every iteration, score the outputs with a dual (rule + LLM) scorer, and let a meta-LLM propose targeted edits to the prompt contract when scores are low. If an edit improves the average score it is committed; if not, it is discarded. The loop runs until a plateau is hit or a target threshold is reached.

while not plateau:
    responses = agent.run(golden_queries) # live calls
    scores = dual_scorer(responses, goldens) # hard + LLM
    
    if avg(scores) > best:
        skill_md = meta_llm.propose_edit(skill_md, failing_cases)
        commit()
        snapshot()
    else:
        discard()
        no_improvement += 1

This makes prompt engineering measurable, reversible, and automated — a property that becomes essential when the agent has to answer questions across multiple live data platforms simultaneously.

The Multi-Data-Platform Context

Modern enterprise finance intelligence is not a single-source problem. "How did operating expenses trend against headcount this quarter?" touches at least three systems. Each platform has its own schema quirks, authentication flow, and output format expectations. The agent from modern agentic systems stitches them together via a Skills pattern.

Solution — AutoAgent Eval Loop

AutoAgent wraps any individual skill’s live agent in an automated hill-climbing loop:

  1. Golden Set — 20 hand-curated query/answer pairs per skill, covering the domain’s key report types.
  2. Dual Scorer — every live response is scored on two axes (regex extraction + LLM semantic check).
  3. Meta-LLM Edit — Bedrock Sonnet receives the current SKILL.md and failing cases to propose an edit.
  4. Keep-or-Discard — if the new average exceeds the best so far, it is committed.

Lessons Learned

  • The dual-scorer is fast and cheap. Haiku judge calls cost < $0.001 per scored case.
  • A golden set of only 20 cases was enough to detect regime changes in agent output.
  • Snapshotting every kept SKILL.md version makes regression rollback trivial.