Job Description

Minimum qualifications:

Bachelor's degree in a technical field, or equivalent practical experience.
5 years of experience in program management.
Experience evaluating Large Language Models (LLMs), working on Natural Language Processing (NLP) data pipelines, or designing data for Reinforcement Learning from Human Feedback (RLHF).
Experience with construct operationalization, design, or behavioral coding manual development.

Preferred qualifications:

PhD degree in Quantitative Psychology, Psychometrics, Educational Measurement, Behavioral Data Science, or a related discipline.
5 years of experience managing cross-functional or cross-team projects.
Experience calculating and interpreting Inter-Rater Reliability (IRR) metrics (e.g., Cohen’s/Fleiss’ Kappa, ICC) and conducting statistical variance analysis on human-generated data.
Experience in Item Response Theory (IRT), Bayesian modeling, or hierarchical/multilevel modeling, specifically applied to rater behavior or differential item functioning.
Experience in computational linguistics, semantics, or designing evaluations for highly nuanced language tasks.
Experience using statistical programming languages (R or Python) to analyze large, complex datasets.

About the job

A problem isn’t truly solved until it’s solved for all. That’s why Googlers build products that help create opportunities for everyone, whether down the street or across the globe. As a Technical Program Manager at Google, you’ll use your technical expertise to lead complex, multi-disciplinary projects from start to finish. You’ll work with stakeholders to plan requirements, identify risks, manage project schedules, and communicate clearly with cross-functional partners across the company. You're equally comfortable explaining your team's analyses and recommendations to executives as you are discussing the technical tradeoffs in product development with engineers.

As the TPM of Human Measurement and Validation, you will be the chief architect of the human evaluation systems that power our Reinforcement Learning (RL) models in Search. Approaching AI evaluation as a massive-scale psychometric challenge, you will translate complex, latent constructs, such as model helpfulness, safety, and reasoning, into highly reliable, standardized behavioral assessments. Bridging the gap between ML research and the science of human measurement, you will design coding manuals, scale automated quality assurance pipelines, and lead statistical calibration. Your work will ensure our global RLHF data maximizes construct validity, minimizes measurement error, and delivers the clinical-grade accuracy required to train safe, capable AI.

In Google Search, we're reimagining what it means to search for information – any way and anywhere. To do that, we need to solve complex engineering challenges and expand our infrastructure, while maintaining a universally accessible and useful experience that people around the world rely on. In joining the Search team, you'll have an opportunity to make an impact on billions of people globally.

The US base salary range for this full-time position is $163,000-$237,000 + bonus + equity + benefits. Our salary ranges are determined by role, level, and location. Within the range, individual pay is determined by work location and additional factors, including job-related skills, experience, and relevant education or training. Your recruiter can share more about the specific salary range for your preferred location during the hiring process.

Please note that the compensation details listed in US role postings reflect the base salary only, and do not include bonus, equity, or benefits. Learn more about benefits at Google

Responsibilities

Deconstruct complex semantic and behavioral models outputs into observable, quantifiable rating criteria, functioning much like diagnostic behavioral anchors.
Establish baseline Inter-Rater Reliability (IRR) metrics (e.g., ICC, Cohen’s/Fleiss’ Kappa) and architect programmatic pipelines to monitor the longitudinal psychometric health of data collections.
Partner with ML Engineering to integrate human assessment workflows directly into the model development lifecycle and deploy automated computerized evaluation tooling.
Design automated data quality checks to detect careless responding, systematic rater bias, straight-lining, and other threats to data integrity at scale.
Utilize advanced statistical frameworks (drawing on Classical Test Theory or Item Response Theory) to detect rater drift, identify differential item functioning, and implement systemic interventions.

About Google

A problem isn't truly solved until it's solved for all. Googlers build products that help create opportunities for everyone, whether down the street or across the globe. Bring your insight, imagination and a healthy disregard for the impossible. Bring everything that makes you unique. Together, we can build for everyone.

Check out our career opportunities at goo.gle/3DLEokh

Industry

IT & Software

Company Size

10,000+ employees

Headquarters

Mountain View, CA

Year Founded

Unknown

Website

google.com

Social Media