kaiko.ai

Senior Evaluation ML Engineer

kaiko.ai  •  Zürich, CH (Onsite)  •  3 months ago
Apply
AI can make mistakes so check important info. Chat history is never stored.

Job Description

About kaiko.ai

kaiko.ai is building a next-generation agentic clinical AI assistant that helps clinicians reason across patient data, guidelines, and diagnostics.

Healthcare decisions are rarely made by a single person or from a single data source. kaiko’s assistant maintains longitudinal patient context across encounters, clinicians, and institutions, enabling collaboration, second opinions, and complex diagnostic workflows. The system is designed to operate safely in real clinical environments, with human oversight, auditability, and regulatory alignment at its core.

Our assistant core supports broadly applicable clinical tasks such as patient data navigation, guideline interaction, multimodal interaction (chat and voice), and care coordination. On top of this foundation, we are developing specialized diagnostic agents in areas such as oncology, radiology, and pathology.

We build in close collaboration with leading hospitals and research centers, including the Netherlands Cancer Institute (NKI). kaiko is a well-funded company with a growing international team, operating from Zurich and Amsterdam.

About the role

Kaiko’s Multimodal Large Language Model (MLLM) is trained on domain-specific, high-complexity medical data. To reach clinical-grade performance, we need comprehensive, large-scale evaluation that is clinically grounded

As a Senior Evaluation ML Engineer, you’ll design and own our end-to-end evaluation stack, from gold-standard ground truths and synthetic benchmark generation to automated release-gating, with a focus on oncology-relevant tasks and metrics You will partner with clinicians, external annotators and ML researchers to ensure that every signal we measure reflects real clinical decision-making and informs our model development efforts.

As a Senior Evaluation ML Engineer you will

  • Build and operate our eval infrastructure at scale (Python + Ray/Spark, Dagster preferred) with strong CI/CD, reproducibility, and observability principles in mind.

  • Source & curate benchmarks (public, licensed, partner-provided) and generate high-fidelity synthetic cases with controls for clinical plausibility, leakage, cohort balance, and difficulty.

  • Define clinically meaningful task taxonomies and rubrics spanning text (clinical notes, reports), imaging (CT/MRI/PET), pathology ( whole-slide images), genomics ( VCF, biomarkers), and structured EHR/FHIR data.

  • Automate offline evaluations and build online evaluation flows (clinician-in-the-loop review, preference/ranking, A/B).

  • Collaborate with clinicians and external partners to facilitate expert evaluations, design annotation protocols, and translate clinical questions into measurable tasks

  • Maintain benchmark hygiene: deduplication, de-identification awareness, leakage audits, stratified sampling, etc.

You will be based in Zurich or Amsterdam, with the expectation of spending ~50% of your time in the office


About you

  • Excellent Python skills and strong software engineering fundamentals (testing, modular design, CI/CD).

  • Deep experience designing & operating evaluation or data-quality pipelines for ML/LLMs at scale.

  • Comfortable with distributed compute (Ray, Spark), data lakehouse paradigms (Delta/Iceberg) and columnar formats (Parquet/ORC).

  • Working knowledge of oncology workflows and terminology: staging (TNM), common biomarkers, lines of therapy, response criteria (e.g., RECIST), typical labs and imaging follow-up.

Nice To Have

  • Experience with eval frameworks (lm-eval-harness, OpenAI Evals, HF Evaluate) and preference modeling.

  • Background in biomed/healthtech (bioinformatics, medical imaging, clinical decision support, translational research, real-world evidence) or graduate work in a related field.

  • Safety/red-teaming for LLMs; familiarity with quality/risk practices for clinical software (e.g., MDR/SaMD concepts).

  • Experience reading and operationalizing radiology, pathology, and molecular reports for evaluation tasks.

  • Hands-on experience with workflow orchestration (Dagster preferred) and monitoring/observability.

  • Experience working with medical foundation models and evaluating them on benchmarks in radiology, pathology, and/or genomics

  • Familiarity with medical standards/ontologies: FHIR/HL7, SNOMED CT, ICD-10/ICD-O, LOINC, DICOM, VCF

We are excited to gather a broad range of perspectives in our team, as we believe it will help us build better products to support a broader set of people. If you’re excited about us but don’t fit every single qualification, we still encourage you to apply: we’ve had incredible team members join us who didn’t check every box.

Why kaiko
At kaiko, we believe the best ideas come from collaboration, ownership and ambition. We’ve built a team of international experts where your work has direct impact. Here’s what we value:

  • Ownership You’ll have the autonomy to set your own goals, make critical decisions, and see the direct impact of your work.

  • Collaboration You’ll have to approach disagreement with curiosity, build on common ground and create solutions together.

  • Ambition You’ll be surrounded by people who set high standards for themselves and others, who see obstacles as opportunities, and who are relentless in their work to create better outcomes for patients.


In addition, we offer

  • An attractive and competitive salary, a good pension plan and 25 vacation days per year.

  • Great offsites and team events to strengthen the team and celebrate successes together.

  • A EUR 1000 learning and development budget to help you grow.

  • Autonomy to do your work the way that works best for you, whether you have a kid or prefer early mornings.

  • An annual commuting subsidy.

Our interview process

Our interview process is designed to assess mutual fit across skills, motivation, and values. It typically includes the following steps:

  • Screening call: A short conversation to align on your motivation, career goals, and initial fit for the role.

  • Technical interview: A deep dive into your problem-solving approach through a technical challenge, case study, or role-specific scenario.

  • Onsite meeting (optional): You’ll meet team members across functions to explore collaboration dynamics, team fit, and day-to-day context.

  • Final executive conversation: A discussion with a member of the executive team focused on long-term alignment, cultural fit, and shared expectations for impact.

kaiko.ai

About kaiko.ai

At kaiko.ai, we’re developing a multimodal clinical assistant for cancer care. 

Built on foundation models trained in close collaboration with academic R&D partners, the assistant’s interface helps cancer care teams quickly synthesize complex medical data, offering timely insights to support critical decisions for each patient.   

Currently in testing, we are working with teams across various cancer care specialties to develop and deploy the latest AI capabilities for clinical use: 

• Distilling critical information from text, images and molecular data  

• Linking modalities through multimodal foundation models for oncology 

• Facilitating diagnosis and treatment planning 

We're refining our approach through close partnerships with leading institutions like the Netherlands Cancer Institute (NKI-AVL), merging clinical expertise with technological innovation. 

Born in Amsterdam in 2021, Kaiko has grown into a dynamic and multidisciplinary team spanning Amsterdam and Zurich. 

Industry
IT & Software
Company Size
51-200 employees
Headquarters
Amsterdam, NL
Year Founded
2021
Website
kaiko.ai
Social Media