kaiko.ai

ML Platform Engineer

kaiko.ai  •  Zürich, CH (Hybrid)  •  3 hours ago
Apply
AI can make mistakes so check important info. Chat history is never stored.

Job Description

About kaiko.ai

Kaiko is building a next-generation agentic clinical AI assistant that helps clinicians reason across patient data, guidelines, and diagnostics.

Healthcare decisions are rarely made by a single person or from a single data source. kaiko's assistant maintains longitudinal patient context across encounters, clinicians, and institutions, enabling collaboration, second opinions, and complex diagnostic workflows. The system is designed to operate safely in real clinical environments, with human oversight, auditability, and regulatory alignment at its core.

Our assistant core supports broadly applicable clinical tasks such as patient data navigation, guideline interaction, multimodal interaction (chat and voice), and care coordination. On top of this foundation, we are developing specialized diagnostic agents in areas such as oncology, radiology, and pathology.

We build in close collaboration with leading hospitals and research centers, including the Netherlands Cancer Institute (NKI). kaiko is a well-funded company with a growing international team, operating from Zurich and Amsterdam.

About the role

As an ML Platform Engineer, you'll help architect, scale, and evolve the infrastructure that powers kaiko's foundation-model training and serving across the full ML lifecycle: from compute orchestration that runs large-scale training jobs, through experiment tracking and model registry, to GPU-backed serving in production.

The Data & AI Platform team owns the foundation underneath kaiko's training and serving workloads: the orchestration and distributed compute layer (Dagster + Ray on Kubernetes), the storage layer (Hammerspace), and the observability that keeps them reliable. We support both research, with large-scale training workloads on open-weight MoE models in the hundreds-of-billions to trillion-parameter range, and product, with the engineering teams building kaiko's clinical applications.

Today, our platform is built around large-scale training. The next frontier is closing the loop: building the model lifecycle layer (experiment tracking, registry, versioning, and GPU-backed serving) that lets our trained models move reliably into product. You'll be part of the team that gets us there.

You'll work closely with both research and product engineering to understand what the platform needs to deliver, and own the engineering of how. You'll contribute to the platform's evolution across both training and serving dimensions, helping translate team needs into reliable, scalable infrastructure.

You will be based in either The Netherlands or Switzerland, with the expectation of spending at least 50% of your time at the office.


Some areas of responsibility
  • Build and evolve the infrastructure that makes ML development fast, reliable, and observable: from IaC to CI/CD to Kubernetes-based workload orchestration.

  • Contribute to the compute orchestration layer: help scale GPU workloads across heterogeneous on-prem and cloud clusters using Kubernetes, Ray, and advanced GPU scheduling technologies such as KAI-Scheduler.

  • Support hybrid and multi-cloud strategies that balance performance, compliance, and cost, including the Hammerspace storage rollout.

  • Own and evolve parts of the AI Factory: the Dagster-based orchestrator and its Ray integration that training jobs depend on.

  • Build and maintain the model lifecycle layer: experiment tracking, model registry, versioning, and GPU-backed serving infrastructure, so models trained on our clusters move reliably into production.

  • Partner with both research and product engineering to understand platform requirements and translate them into shared infrastructure that works across use cases.

  • Bring engineering rigour to a fast-moving stack: lineage, reproducibility, ownership boundaries, and documentation that lets the team move quickly without losing track.

About you
  • 2-5 years of experience in production ML platform engineering or ML Ops role.

  • Hands-on experience with Kubernetes, Helm, Terraform, Docker, and CI/CD tooling (ArgoCD, GitHub Actions, or comparable). Comfortable with modern observability stacks (Grafana, Prometheus, Loki, or similar).

  • Compute orchestration at scale. Production experience scheduling GPU workloads on Kubernetes, Ray, or comparable, or strong motivation to grow into this quickly.

  • ML infrastructure exposure. Hands-on experience with Linux and NVIDIA GPU environments, including multi-node training stacks and the networking that connects them (InfiniBand or comparable). You have a feel for how researchers and engineers use the platform, not just how it's built.

  • Familiarity with the full ML workflow: training runs, experiment tracking, model registry and versioning (MLflow or equivalent), and model serving. You understand how models move from training to production.

  • Solid software engineering. Python as a baseline; openness to picking up Go or other programming languages where the platform demands it.

  • Collaborative by default. You work across team boundaries (research, product, data engineering) and design interfaces that serve multiple stakeholders without requiring everyone to understand the underlying plumbing.

Nice to have:
  • Direct experience supporting large-scale foundation-model training or inference, with a working sense of how a run behaves from the platform's perspective: MFU, GPU utilisation, communication overhead, and dataloader stalls on the training side; serving frameworks (vLLM, Triton, TorchServe), KV cache management, and batching strategies on the inference side.

  • Familiarity with lower-level GPU communication and I/O concepts: RDMA, GPUDirect Storage (GDS), GPUDirect RDMA, NCCL, and how these affect training throughput and cluster design.

  • Experience with Kubernetes-native scheduling stacks for accelerator-heavy workloads, such as Volcano, KAI Scheduler, YuniKorn, or similar, so scheduler tradeoffs are understood rather than defaulted to.

  • High performance (parallel) filesystems that feed GPU's with high throughput and low latency (Hammerspace, CEPH, WEKA, VAST).

  • On-prem and hybrid GPU environments, especially in regulated settings (healthcare, finance, public sector).

  • Exposure to MoE architectures or large-scale distributed training, useful for anticipating what future workloads will demand of the platform.

We are excited to gather a broad range of perspectives in our team, as we believe it will help us build better products to support a broader set of people. If you’re excited about us but don’t fit every single qualification, we still encourage you to apply: we’ve had incredible team members join us who didn’t check every box!

Why kaiko

At kaiko, we believe the best ideas come from collaboration, ownership and ambition. We’ve built a team of international experts where your work has direct impact. Here’s what we value:

  • Ownership: You’ll have the autonomy to set your own goals, make critical decisions, and see the direct impact of your work.

  • Collaboration: You’ll have to approach disagreement with curiosity, build on common ground and create solutions together.

  • Ambition: You’ll be surrounded by people who set high standards for themselves and others, who see obstacles as opportunities, and who are relentless in their work to create better outcomes for patients.


In addition, we offer:
  • An attractive and competitive salary, a good pension plan and 25 vacation days per year.

  • Great offsites and team events to strengthen the team and celebrate successes together.

  • A EUR 1000 learning and development budget to help you grow.

  • Autonomy to do your work the way that works best for you, whether you have a kid or prefer early mornings.

  • An annual commuting subsidy.

Our interview process

Our interview process is designed to assess mutual fit across skills, motivation, and values. It typically includes the following steps:

  • Screening call: A short conversation to align on your motivation, career goals, and initial fit for the role.

  • Coding assessment & debrief: You'll complete a time-limited coding exercise, started at a time of your choosing. Afterwards, you'll join a live session with members of our team to walk through your solution, explain your reasoning, and discuss any trade-offs you made.

  • Technical interview: A deep dive into your problem-solving approach through a technical challenge, case study, or role-specific scenario.

  • Onsite meeting (optional): You’ll meet team members across functions to explore collaboration dynamics, team fit, and day-to-day context.

  • Final executive conversation: A discussion with a member of the executive team focused on long-term alignment, cultural fit, and shared expectations for impact.

kaiko.ai

About kaiko.ai

At kaiko.ai, we’re developing a multimodal clinical assistant for cancer care. 

Built on foundation models trained in close collaboration with academic R&D partners, the assistant’s interface helps cancer care teams quickly synthesize complex medical data, offering timely insights to support critical decisions for each patient.   

Currently in testing, we are working with teams across various cancer care specialties to develop and deploy the latest AI capabilities for clinical use: 

• Distilling critical information from text, images and molecular data  

• Linking modalities through multimodal foundation models for oncology 

• Facilitating diagnosis and treatment planning 

We're refining our approach through close partnerships with leading institutions like the Netherlands Cancer Institute (NKI-AVL), merging clinical expertise with technological innovation. 

Born in Amsterdam in 2021, Kaiko has grown into a dynamic and multidisciplinary team spanning Amsterdam and Zurich. 

Industry
IT & Software
Company Size
51-200 employees
Headquarters
Amsterdam, NL
Year Founded
2021
Website
kaiko.ai
Social Media