Job Description
Senior ICs who build ARIP's 15 named agents (A-11..A-25) end-to-end on LangGraph / CrewAI: prompt design, tool definitions, multi-step workflows, eval harnesses (golden sets, regression gates, LLM-as-judge, multi-step replay), HITL gate integration, Trust Gate progression, and per-agent cost optimisation. Distinct from DEAA's Senior AI Engineer who owns the LLM Gateway — ARIP AI Engineers are platform consumers and agent builders.
Remote candidates outside of Thailand are welcome to apply.
Key Responsibilities:
- Build agents on Layer 4 runtime end-to-end — each ships with eval harness, HITL gate config, observability instrumentation, per-agent cost meter, and runbook.
- Design and own golden-set test cases per agent; build regression gates in CI (no agent ships without eval-pass); implement multi-step conversation replay and LLM-as-judge patterns.
- Configure per-agent HITL gates and collect gate-progression evidence (Shadow 60d → Recommender 90d → Executor); co-own Trust Gate Framework for Suite 3 financial-threshold ladder (G0–G4).
- Tune model routing per agent (LLM provider / model tier): balance cost, latency, quality; implement semantic caching where appropriate.
- Consume DEAA's LLM Gateway via standard SDK; provide per-agent cost data to DEAA's GenAI Cost Dashboard; partner with DEAA Senior AI Engineer on embedding model selection and retrieval relevance.
- Author agent-engineering playbook alongside DEAA's AI Best Practices Playbook; mentor PACE-seeded engineers on agent engineering discipline.
Requirements
- 5+ years software engineering; 2+ years shipping LLM-based / agentic systems to production (not just RAG demos or notebooks).
- Expert in production multi-agent orchestration: LangGraph / CrewAI / AutoGen / DSPy or equivalent with HITL gates by default, not autonomous-by-default.
- Eval-driven LLM development in production: golden sets, LLM-as-judge, multi-step replay, regression gates in CI.
- HITL gate and agent guardrail design: prompt injection / PII / output filtering defences — designs and tests them in production.
- Strong Python (async, observability, testing); major LLM provider (Azure OpenAI / Anthropic / Bedrock / Vertex) production experience; Langfuse or equivalent for LLM tracing,
- Calibre: Senior AI Engineer from agentic-AI startups (Anthropic-adjacent ecosystem), Agoda, LINE MAN Wongnai, Grab, SCBX with multi-agent production experience.