Kaiko.AI Industry · Engineering

Senior ML Engineer (Evaluation)

CHF 150'000 – 170'000 / year

ZÜRICH

LARGE LANGUAGE MODELML ENGINEERMLOPSLLMAGENTIC

About kaiko.ai

kaiko building a next-generation agentic clinical AI assistant that helps clinicians reason across patient data, guidelines, and diagnostics. Healthcare decisions are rarely made by a single person or from a single data source. kaiko’s assistant maintains longitudinal patient context across encounters, clinicians, and institutions, enabling collaboration, second opinions, and complex diagnostic workflows. The system is designed to operate safely in real clinical environments, with human oversight, auditability, and regulatory alignment at its core. Our assistant core supports broadly applicable clinical tasks such as patient data navigation, guideline interaction, multimodal interaction (chat and voice), and care coordination. On top of this foundation, we are developing specialized diagnostic agents in areas such as oncology, radiology, and pathology. We build in close collaboration with leading hospitals and research centers, including the Netherlands Cancer Institute (NKI). kaiko is a well-funded company with a growing international team, operating from Zurich and Amsterdam.

About the Role

Kaiko’s Multimodal Large Language Model (MLLM) is trained on domain-specific, high-complexity medical data. Reaching clinical-grade performance demands a comprehensive evaluation stack that is fast, reliable, and deeply integrated with our model development loop. As a Senior Evaluation ML Engineer, you will own the engineering stack to run evaluations at scale, from efficient inference across a growing set of frontier models to async evaluation against a wide array of clinical benchmarks, enabling automated orchestration of our pipelines with a strong eye for observability and production-grade system organisation. You will work closely with other ML researchers and product to translate research and clinical requirements into reliable and well-engineered eval signals.

As a Senior ML Evaluation Engineer you will

Own AI factory orchestration for evaluation workloads, with Dagster as the primary orchestration layer: design, operate, and mature the pipelines and workflows that run large-scale evaluation jobs, and extend automation across the stack wherever possible.
Maintain and evolve the inference services that power evaluation runs, including cluster- and actor-level resource management, ensuring correctness, reproducibility, and throughput as the model and benchmark zoo grows.
Ensure the functional integrity of the eval stack through rigorous testing and validation: verify model integrations, confirm expected behaviour across configurations, and support ML researchers in understanding model outputs.
Own Eval/MLOps end-to-end: service deployments, model registry and artifact versioning, eval database organisation, rollout and rollback procedures, and post-deployment observability.
Develop towards a technical lead: set engineering direction, make architectural decisions, and support other engineers in execution.

You will be based in Zurich or Amsterdam, with the expectation of spending ∼50% of your time in the office.

About you

Essential

Excellent Python skills and strong software engineering fundamentals: testing, modular design, CI/CD, code review, and monorepo tooling.
Proven experience building and operating ML inference services or MLOps infrastructure at scale, ideally for large language or multimodal models.
Hands-on experience with distributed compute and GPU workloads: familiarity with frameworks such as Ray, CUDA toolchains, and container runtimes (Docker/Kubernetes or equivalent).
Experience with model serving frameworks such as vLLM, TensorRT-LLM, Triton Inference Server, or similar.
Experience with workflow orchestration tools, with a preference for Dagster; ability to design reliable, maintainable pipeline DAGs.
Familiarity with the full deployment lifecycle, from containerisation and config management to observability, alerting, and incident response.
Ability to read and reason about model internals at a low level: tokenisation, numerical precision, tensor shapes, and inference-time behaviour.
Prior experience in the medical domain is not required, but a strong motivation to push the frontier of clinical AI through excellent engineering is.

Nice to have

Experience acting as a technical lead: setting direction on an engineering sub-system, making architectural trade-offs, and guiding other engineers.
Familiarity with eval frameworks (lm-eval-harness, OpenAI Evals, HF Evaluate) and benchmark integration pipelines.
Background in software-defined infrastructure, IaC tooling (Terraform, Pulumi), or cloud-native deployments on AWS/GCP/Azure.
Safety and reliability engineering mindset: experience with red-teaming, load testing, or quality practices for production AI systems.

We are excited to gather a broad range of perspectives in our team, as we believe it will help us build better products to support a broader set of people. If you’re excited about us but don’t fit every single qualification, we still encourage you to apply: we’ve had incredible team members join us who didn’t check every box.

Why kaiko

At kaiko, we believe the best ideas come from collaboration, ownership and ambition. We’ve built a team of international experts where your work has direct impact. Here’s what we value:

Ownership: You’ll have the autonomy to set your own goals, make critical decisions, and see the direct impact of your work.
Collaboration: You’ll have to approach disagreement with curiosity, build on common ground and create solutions together.
Ambition: You’ll be surrounded by people who set high standards for themselves and others, who see obstacles as opportunities, and who are relentless in their work to create better outcomes for patients.

In addition, we offer

An attractive and competitive salary, a good pension plan and 25 vacation days per year.
Great offsites and team events to strengthen the team and celebrate successes together.
A EUR 1,000 learning and development budget to help you grow.
Autonomy to do your work the way that works best for you, whether you have a kid or prefer early mornings.
An annual commuting subsidy.

Our interview process

Our interview process is designed to assess mutual fit across skills, motivation, and values. It typically includes the following steps:

Screening call: A short conversation to align on your motivation, career goals, and initial fit for the role.
Technical interview: A deep dive into your problem-solving approach through a technical challenge, case study, or role-specific scenario.
Onsite meeting (optional): You’ll meet team members across functions to explore collaboration dynamics, team fit, and day-to-day context.
Final executive conversation: A discussion with a member of the executive team focused on long-term alignment, cultural fit, and shared expectations for impact.

Apply Now