Multi-step support agent for a logistics SaaS — Pro tier
Client: Acme Logistics (mid-market freight TMS, ~6k seats)
Built a chained support agent that classifies, retrieves, drafts, and hands off — wired to Zendesk, the customer's Postgres, and a hosted eval harness. Shipped on the Pro tier ($350/mo) in 8 working days. Now resolves 68% of tier-1 tickets end-to-end at $0.04 per resolution.
- median cost per resolved ticket
- $0.04
- classification accuracy on prod eval set
- 94%
- first production deploy
- 8 days
- false-confident reply rate
- 0.2%
Challenge
Acme's support team was 11 agents handling roughly 1,400 tickets/week. The CTO had spun up two internal LLM prototypes; neither cleared an eval bar (one hallucinated billing terms, the other answered 100% of tickets but with 14% wrong-answer rate). They had decided 'AI for support is two years out' when they hit our pricing page and asked what $350/mo actually shipped.
The real problem was unbounded scope. Both prototypes tried to answer everything. A productized Pro-tier build forced narrower scope: classify the ticket, retrieve from a known corpus, draft a reply with citations, and hand off (don't guess) when confidence is low.
Approach
Day 1-2: Pulled 90 days of closed Zendesk tickets and built a labeled eval set (1,200 examples across 6 intents). Wrote the eval harness before the agent — pass/fail on classification + citation grounding + a 'false-confident' check that flags any reply asserting a fact not in the retrieved context.
Day 3-5: Built the agent as four chained steps — classify (Haiku), retrieve (pgvector over their KB + last 30d of resolved tickets), draft (Sonnet with citation requirement), confidence-gate (heuristic + judge model). Anything under threshold routes to a human with the draft hidden, because a missing draft is cheaper than a wrong one.
Day 6-8: Wired to Zendesk via webhook, deployed on our hosted endpoint, set up the cost + accuracy dashboard the Pro tier ships with. Monthly retainer covers prompt iteration, eval refresh, and adding new intents as the product surfaces shift.
Outcome
End-to-end resolution sits at 68% on tier-1 (target was 50%). Median cost per resolved ticket is $0.04 — well under the $0.12 budget the CFO had penciled in. Classification accuracy holds at 94% against a refreshed weekly eval set. False-confident reply rate is 0.2%, below the 0.5% gate. Human reviewers see 32% of tickets instead of 100%, and time-to-first-response on the routed-to-human cohort dropped from 2h 40m to 11 minutes because the draft + retrieved context arrives with the ticket. Retainer renewed at month 2; the team has since asked us to scope a Pro-tier build for their billing-disputes flow.
Stack
- Claude Sonnet + Haiku (chained)
- pgvector on Supabase
- Hosted eval harness (Promptfoo + custom judges)
- Zendesk webhook integration
- Cost + accuracy dashboard (Pro tier default)
Working on something similar?
A partner will respond personally within one business day. If there isn't a fit, we'll tell you that, and point you somewhere better.