Recruiting from Scratch

Research Engineer – Benchmarking, Evals & Failure Analysis

Recruiting from Scratch  •  $130k - $500k/yr  •  San Francisco, CA (Onsite)  •  1 month ago
Apply
AI can make mistakes so check important info. Chat history is never stored.

Job Description

Research Engineer – Benchmarking, Evals & Failure Analysis

Location: San Francisco
Company Stage: Late-Stage / Series C (AI / Applied ML)
Office Type: Onsite (5 Days a Week)
Salary: $130,000 – $400,000 + Equity

This fast-growing AI company is operating at the forefront of applied machine learning and labor transformation. By partnering with leading AI labs and enterprises, they are building systems that combine human expertise with cutting-edge AI to improve model performance and unlock new categories of work. With strong revenue, scale, and backing from top-tier investors, they are shaping how frontier models are trained, evaluated, and deployed in real-world environments.

What You Will Do

  • Design and implement benchmarking systems to evaluate model capabilities such as tool use, reasoning, and agent behavior
  • Build and operate end-to-end evaluation pipelines, including scoring systems, dashboards, and reporting infrastructure
  • Conduct systematic failure analysis on model outputs, identifying key failure modes and translating them into actionable improvements
  • Develop rubrics, evaluators, and scoring frameworks that balance rigor with scalability (human + automated evaluation)
  • Partner with research and applied AI teams to align evaluation systems with training and product goals
  • Analyze data quality and performance trends to inform model training, data generation, and post-training strategies
  • Own evaluation and benchmarking systems in a fast-paced, high-iteration environment

Ideal Background

  • Strong applied AI or ML engineering experience, particularly in model evaluation, benchmarking, or failure analysis
  • Hands-on experience building or running LLM evaluation systems, benchmarks, or experimentation pipelines
  • Strong coding ability (Python or similar) with experience building production-quality systems
  • Solid understanding of data structures, algorithms, and backend systems
  • Experience working with APIs, databases (SQL/NoSQL), and cloud infrastructure
  • Ability to reason deeply about model behavior, experimental results, and system performance
  • Comfortable operating in ambiguous, high-ownership environments with rapid iteration cycles

Preferred

  • Experience working on post-training, RL, or evaluation teams at AI labs or AI-first companies
  • Familiarity with LLM evaluation techniques, benchmarking frameworks, or agent evaluation systems
  • Experience with synthetic data generation, rubric design, or reward modeling workflows
  • Publications or research experience in ML, especially in evaluation or benchmarking
  • Exposure to large-scale experimentation systems or model performance tracking infrastructure

Compensation and Benefits

  • Competitive base salary ($130K – $500K) + meaningful equity
  • Relocation and housing support available
  • Monthly meal stipend and premium wellness perks (e.g., fitness membership)
  • Comprehensive health insurance
  • Opportunity to work directly with frontier AI labs and influence model development at the cutting edge

This is a high-impact role at the intersection of engineering and applied AI research, ideal for candidates excited about defining how next-generation models are evaluated, improved, and deployed at scale.

Recruiting from Scratch

About Recruiting from Scratch

Recruiting from Scratch provides recruiting services for companies that need to hire the best talent in software engineering, hardware engineering, product design, product management, marketing, GTM, and accounting & finance.

Industry
HR & Recruiting
Company Size
51-200 employees
Headquarters
New York, NY
Year Founded
2021
Social Media