Job Description
Research Engineer – Benchmarking, Evals & Failure Analysis
Location: San Francisco
Company Stage: Late-Stage / Series C (AI / Applied ML)
Office Type: Onsite (5 Days a Week)
Salary: $130,000 – $400,000 + Equity
This fast-growing AI company is operating at the forefront of applied machine learning and labor transformation. By partnering with leading AI labs and enterprises, they are building systems that combine human expertise with cutting-edge AI to improve model performance and unlock new categories of work. With strong revenue, scale, and backing from top-tier investors, they are shaping how frontier models are trained, evaluated, and deployed in real-world environments.
What You Will Do
- Design and implement benchmarking systems to evaluate model capabilities such as tool use, reasoning, and agent behavior
- Build and operate end-to-end evaluation pipelines, including scoring systems, dashboards, and reporting infrastructure
- Conduct systematic failure analysis on model outputs, identifying key failure modes and translating them into actionable improvements
- Develop rubrics, evaluators, and scoring frameworks that balance rigor with scalability (human + automated evaluation)
- Partner with research and applied AI teams to align evaluation systems with training and product goals
- Analyze data quality and performance trends to inform model training, data generation, and post-training strategies
- Own evaluation and benchmarking systems in a fast-paced, high-iteration environment
Ideal Background
- Strong applied AI or ML engineering experience, particularly in model evaluation, benchmarking, or failure analysis
- Hands-on experience building or running LLM evaluation systems, benchmarks, or experimentation pipelines
- Strong coding ability (Python or similar) with experience building production-quality systems
- Solid understanding of data structures, algorithms, and backend systems
- Experience working with APIs, databases (SQL/NoSQL), and cloud infrastructure
- Ability to reason deeply about model behavior, experimental results, and system performance
- Comfortable operating in ambiguous, high-ownership environments with rapid iteration cycles
Preferred
- Experience working on post-training, RL, or evaluation teams at AI labs or AI-first companies
- Familiarity with LLM evaluation techniques, benchmarking frameworks, or agent evaluation systems
- Experience with synthetic data generation, rubric design, or reward modeling workflows
- Publications or research experience in ML, especially in evaluation or benchmarking
- Exposure to large-scale experimentation systems or model performance tracking infrastructure
Compensation and Benefits
- Competitive base salary ($130K – $500K) + meaningful equity
- Relocation and housing support available
- Monthly meal stipend and premium wellness perks (e.g., fitness membership)
- Comprehensive health insurance
- Opportunity to work directly with frontier AI labs and influence model development at the cutting edge
This is a high-impact role at the intersection of engineering and applied AI research, ideal for candidates excited about defining how next-generation models are evaluated, improved, and deployed at scale.