Job Description

Research Engineer – Benchmarking, Evals & Failure Analysis

Location: San Francisco
Company Stage: Late-Stage / Series C (AI / Applied ML)
Office Type: Onsite (5 Days a Week)
Salary: $130,000 – $400,000 + Equity

This fast-growing AI company is operating at the forefront of applied machine learning and labor transformation. By partnering with leading AI labs and enterprises, they are building systems that combine human expertise with cutting-edge AI to improve model performance and unlock new categories of work. With strong revenue, scale, and backing from top-tier investors, they are shaping how frontier models are trained, evaluated, and deployed in real-world environments.

What You Will Do

Design and implement benchmarking systems to evaluate model capabilities such as tool use, reasoning, and agent behavior
Build and operate end-to-end evaluation pipelines, including scoring systems, dashboards, and reporting infrastructure
Conduct systematic failure analysis on model outputs, identifying key failure modes and translating them into actionable improvements
Develop rubrics, evaluators, and scoring frameworks that balance rigor with scalability (human + automated evaluation)
Partner with research and applied AI teams to align evaluation systems with training and product goals
Analyze data quality and performance trends to inform model training, data generation, and post-training strategies
Own evaluation and benchmarking systems in a fast-paced, high-iteration environment

Ideal Background

Strong applied AI or ML engineering experience, particularly in model evaluation, benchmarking, or failure analysis
Hands-on experience building or running LLM evaluation systems, benchmarks, or experimentation pipelines
Strong coding ability (Python or similar) with experience building production-quality systems
Solid understanding of data structures, algorithms, and backend systems
Experience working with APIs, databases (SQL/NoSQL), and cloud infrastructure
Ability to reason deeply about model behavior, experimental results, and system performance
Comfortable operating in ambiguous, high-ownership environments with rapid iteration cycles

Preferred

Experience working on post-training, RL, or evaluation teams at AI labs or AI-first companies
Familiarity with LLM evaluation techniques, benchmarking frameworks, or agent evaluation systems
Experience with synthetic data generation, rubric design, or reward modeling workflows
Publications or research experience in ML, especially in evaluation or benchmarking
Exposure to large-scale experimentation systems or model performance tracking infrastructure

Compensation and Benefits

Competitive base salary ($130K – $500K) + meaningful equity
Relocation and housing support available
Monthly meal stipend and premium wellness perks (e.g., fitness membership)
Comprehensive health insurance
Opportunity to work directly with frontier AI labs and influence model development at the cutting edge

This is a high-impact role at the intersection of engineering and applied AI research, ideal for candidates excited about defining how next-generation models are evaluated, improved, and deployed at scale.

About Recruiting from Scratch

Recruiting from Scratch provides recruiting services for companies that need to hire the best talent in software engineering, hardware engineering, product design, product management, marketing, GTM, and accounting & finance.

Industry

HR & Recruiting

Company Size

51-200 employees

Headquarters

New York, NY

Year Founded

2021

Website

recruitingfromscratch.com

Social Media