Logistics SaaSPro tierPro tier — productized build + monthly iterationShipped in 8 days, monthly retainer ongoing

Multi-step support agent for a logistics SaaS — Pro tier

Client: Acme Logistics (mid-market freight TMS, ~6k seats)

Built a chained support agent that classifies, retrieves, drafts, and hands off — wired to Zendesk, the customer's Postgres, and a hosted eval harness. Shipped on the Pro tier ($350/mo) in 8 working days. Now resolves 68% of tier-1 tickets end-to-end at $0.04 per resolution.

median cost per resolved ticket: $0.04
classification accuracy on prod eval set: 94%
first production deploy: 8 days
false-confident reply rate: 0.2%

Challenge

Acme's support team was 11 agents handling roughly 1,400 tickets/week. The CTO had spun up two internal LLM prototypes; neither cleared an eval bar (one hallucinated billing terms, the other answered 100% of tickets but with 14% wrong-answer rate). They had decided 'AI for support is two years out' when they hit our pricing page and asked what $350/mo actually shipped.

The real problem was unbounded scope. Both prototypes tried to answer everything. A productized Pro-tier build forced narrower scope: classify the ticket, retrieve from a known corpus, draft a reply with citations, and hand off (don't guess) when confidence is low.

Approach

Day 1-2: Pulled 90 days of closed Zendesk tickets and built a labeled eval set (1,200 examples across 6 intents). Wrote the eval harness before the agent — pass/fail on classification + citation grounding + a 'false-confident' check that flags any reply asserting a fact not in the retrieved context.

Day 3-5: Built the agent as four chained steps — classify (Haiku), retrieve (pgvector over their KB + last 30d of resolved tickets), draft (Sonnet with citation requirement), confidence-gate (heuristic + judge model). Anything under threshold routes to a human with the draft hidden, because a missing draft is cheaper than a wrong one.

Day 6-8: Wired to Zendesk via webhook, deployed on our hosted endpoint, set up the cost + accuracy dashboard the Pro tier ships with. Monthly retainer covers prompt iteration, eval refresh, and adding new intents as the product surfaces shift.

Outcome

End-to-end resolution sits at 68% on tier-1 (target was 50%). Median cost per resolved ticket is $0.04 — well under the $0.12 budget the CFO had penciled in. Classification accuracy holds at 94% against a refreshed weekly eval set. False-confident reply rate is 0.2%, below the 0.5% gate. Human reviewers see 32% of tickets instead of 100%, and time-to-first-response on the routed-to-human cohort dropped from 2h 40m to 11 minutes because the draft + retrieved context arrives with the ticket. Retainer renewed at month 2; the team has since asked us to scope a Pro-tier build for their billing-disputes flow.

Stack

Claude Sonnet + Haiku (chained)
pgvector on Supabase
Hosted eval harness (Promptfoo + custom judges)
Zendesk webhook integration
Cost + accuracy dashboard (Pro tier default)

Working on something similar?

A partner will respond personally within one business day. If there isn't a fit, we'll tell you that, and point you somewhere better.

Start a conversation More case studies