Role: ML Ops Engineer
Location: We operate a hybrid schedule, meaning 2-3 days a week in the office based at Thorpe Park, Leeds.
Salary: £ DOE plus extensive benefits
Contract type: Permanent
Employment type: Full time
Working hours: We work on a core hours principle. Our core hours are 09:30 - 16:00; you can work around these to suit you!
Do you want to work for the nation’s largest online pharmacy ensuring excellence for all our patients? We’re a market leader in the pharmacy world, with 25 years’ experience, helping over 1.8 million patients in England manage their NHS prescriptions from request through to delivery. We are Great Place to Work certified as we consider colleague experience a top priority every day, and as a certified B Corp we also meet high standards of social and environmental responsibility. Our people are fundamental to our success and ensuring we achieve our vision to be a world leading, patient-centric digital healthcare provider. We are committed to continuing to develop a positive, open and honest working environment for all.
Our tech teams keep us running 24/7 to make sure all our patients get world class service. To support that, this role may include participation in an out-of-hours rota as required by the business. We operate fair scheduling process as well as additional compensation for all on call periods.
The ML Ops Engineer will drive the operation of production‑grade Machine Learning and LLM services on Azure, ensuring models run as reliable, scalable, and high‑performing systems. Owning the end‑to‑end MLOps/LLMOps lifecycle, the role leads on CI/CD, deployment automation, monitoring, and incident response.
Working closely with Data Science, this role turns models into robust production services, bringing strong governance, observability, and continuous optimisation to ensure fast, safe, and efficient delivery at scale.
Why you’ll love working with us
We believe great people deserve great support. That’s why we offer a benefits package designed to look after your health, finances, career and life outside work.
Financial security & rewards
· Competitive contributory pension
· Occupational sick pay
· Long-service awards and refer-a-friend bonuses
· Professional registration fees covered (GPhC, NMC, CIPD and more)
· Cycle to Work and Green Car schemes (subject to eligibility)
Family-friendly
· Enhanced maternity and paternity pay
· Flexible hybrid working to help balance work and home life
Health & wellbeing
· Private healthcare insurance at discounted rates (Aviva)
· Employee Assistance Programme and in-house mental health support
· Access to discounted gym memberships via Blue Light Card and benefits schemes
· Regular health and wellbeing initiatives
Career growth
· Strong commitment to CPD, training and professional development
Time off & flexibility
· 25 days’ annual leave, increasing with service
· Buy and sell holiday scheme
Everyday perks & exclusive discounts
· Blue Light Card and employee discount platform
· Exclusive discounts at The Springs, Leeds
· 25% off health & beauty purchases
· 25% off Pharmacy2U Private Online Doctor services
Culture & community
· Regular social events throughout the year
What you’ll be doing?
Production Deployment & Release Engineering
· Design and operate CI/CD pipelines for ML models and LLM prompt‑flows, covering build, test, validation, deployment, and rollback
· Own model registration and promotion across environments, ensuring traceability, governance, and auditability
· Implement safe deployment strategies (e.g. blue/green, canary, champion/challenger)
· Package and deploy containerised inference services and batch pipelines, ensuring repeatability and rapid rollback
Reliability Engineering (Day 2 Operations)
· Run ML and LLM services as production‑grade systems, defining SLOs/SLIs, dashboards, and alerting
· Lead incident response for runtime issues, including triage, mitigation, recovery, and post‑incident reviews
· Develop and maintain operational runbooks covering restart, rollback, secret rotation, and safe‑mode scenarios
· Improve service resilience and reduce MTTR through automation (e.g. self‑healing, retries, fallbacks, circuit breakers)
Observability (Service, Data, Model & Cost)
· Implement monitoring for availability, latency, errors, resource usage, and job performance
· Monitor data quality including freshness, volume, completeness, schema drift, and distribution changes
· Monitor model performance, including drift and prediction distribution shifts, and track accuracy where labels exist
· Instrument LLM services for token usage, latency, and safety signals, with clear visibility into cost, quotas, and risks
LLMOps: Lifecycle, Quality & Safety
· Manage prompts and workflows as code, including versioning, code reviews, and automated regression testing
· Own production configuration for LLM deployments, including model updates, limits, and safeguards
· Partner with Data Science and Security to ensure robust safety practices, including PII protection and prompt‑injection testing
Security, Privacy & Governance
· Implement secure access controls, identity management, and secrets handling aligned to best practice
· Support production readiness through documentation, monitoring plans, cost models, and audit evidence
· Ensure all changes follow structured governance, with clear traceability and reproducibility
Who are we looking for?
· Strong Python engineering skills, with experience in ML frameworks such as scikit‑learn, PyTorch, or TensorFlow, and familiarity with experiment tracking
· Comfortable working in regulated environments, with an understanding of privacy, auditability, change control, and handling sensitive data
· Strong DevOps/SRE background, including CI/CD, Infrastructure as Code, monitoring and alerting, incident management, and reliability engineering
· Hands‑on experience with containerisation using tools such as Docker and Kubernetes (e.g. AKS), including debugging, performance tuning, and working with container registries
· Experience working with Azure, ideally including Azure Machine Learning (pipelines, registries, online and batch endpoints) and Azure Monitor or Log Analytics
· Experience operationalising ML pipelines, including training, batch scoring, feature engineering workflows, and preventing training‑serving skew
· Experience implementing safe deployment practices such as blue/green or canary releases, supported by automated validation
· Understanding of data contracts, schema evolution, and data quality practices, with the ability to troubleshoot data drift and missing features
What happens next?
Please click apply and if we think you are a good match, we will be in touch to arrange an interview.
Applicants must prove they have the right to live in the UK.
All successful applicants will be required to undergo a DBS check.
Unsolicited agency applications will be treated as a gift.
#LI-OW1

The UK's largest digital pharmacy. Proudly partnered with the NHS and helping over 1.6 million patients take control of their healthcare. Certified B Corp 💙