Job Description
Introduction to the Role
Transform AI into a true force multiplier for enterprise operations! This role advises how machine learning and artificial intelligence platforms are run, automated, and improved across Azure and AWS to support critical scientific and business outcomes. The focus is on defining system architecture, monitoring, and automation, ensuring operations itself becomes a benchmark for AI adoption.
Managing a layered operations framework—spanning L1 runbook operators, L2 site reliability engineers (SREs), and an L3 product engineering interface—this position establishes continuous improvement. The goal is to eliminate manual toil, increase first-line resolution, and free engineering time for higher-value work. Every incident must become a detailed procedure, an automated process, or a permanent fix.
This is not standard support; it is a leadership role for an engineering-minded operator setting precise technical direction!
Accountabilities
- Operational Model Ownership Lead and evolve the three-tier operations model for the AI/ML platform estate. Enforce operational readiness gates and run monthly reviews using clear metrics: L1 resolution rate, repeat incident rate, automation coverage, and toil budget compliance.
- Technical Direction Establish instrumentation and alerting standards for a centralised observability layer (Datadog, New Relic, Grafana, Splunk). Guide the architecture of AI-augmented operations tooling, including conversational runbooks, and direct L2 SRE patch contributions to resolve root causes.
- Automation Strategy Mandate that every incident yields a runbook, an automation, or a patch. Define acceptable toil thresholds and prioritise automation investments by incident frequency, resolution time, and blast radius.
- Team Leadership and Development Direct the L2 SRE team aligned to cloud domains. Build a robust talent pipeline from L1 to L2, fostering a culture where SREs operate as engineers dedicated to operational excellence.
- Stakeholder Interface Partner with product engineering to co-own post-mortems, negotiate handovers, and embed operational requirements into early architecture decisions. Communicate platform health, risks, and investments clearly to senior leadership using data-driven narratives.
Essential Skills and Experience
- Academic Background BSc/MSc/PhD in Computer Science or a related analytical field.
- Operations Leadership Demonstrable recent experience building and leading large-scale SRE or platform operations functions hands-on.
- Observability Expertise Deep technical knowledge of platforms such as Datadog, New Relic, Grafana, or Splunk, covering dashboard development, alerting strategies, and telemetry pipeline architecture.
- Modern Standards Solid understanding of OpenTelemetry, distributed tracing, and structured logging.
- Automation Delivery Consistent track record of designing and implementing automation that materially reduces operational toil.
- Cloud Infrastructure Strong grasp of Azure and/or AWS, including container orchestration, serverless architectures, and managed services.
- Incident Management Proven ability to run post-mortem processes and translate findings into preventative engineering, alongside experience defining platform handover criteria.
- Technical Mentorship Capability to provide precise technical direction on system instrumentation, alert triggers, and automation interventions while developing high-performing teams.
Desirable Skills and Experience
- AI/ML Operations Application of AI/ML to operational challenges (e.g., intelligent alerting, automated diagnosis, conversational interfaces).
- Workload Management Experience operating platforms serving AI/ML workloads such as LLM inference, model serving, and data pipelines.
- Industry Context Familiarity with regulated pharmaceutical environments or the AstraZeneca technology estate.
- Frameworks ITIL, SRE, or operational excellence certifications.
Working Environment
Bringing unexpected teams together sparks bold thinking. To facilitate this, we operate on a hybrid model, working an average of three days per week from the office while respecting individual flexibility. Join us in our unique and ambitious world!
Why AstraZeneca
Advanced data and AI are embedded in how medicines are discovered, developed, and delivered. This leadership role directly improves speed, reliability, and safety at a global scale. Work alongside deep specialists, apply cutting-edge techniques with real-world impact, and help shape how AI is used daily. The culture values kindness alongside ambition, encourages clear thinking, and provides the space to publish breakthroughs that advance the field.
Equal Opportunity
AstraZeneca is an equal opportunity employer. We celebrate diversity and are committed to creating an inclusive environment for all employees.
Call to Action
Lead the charge to build an AI-augmented operations engine that frees engineers to innovate and accelerates patient impact. Apply today to start the conversation!
#EAI
Date Posted
18-May-2026
Closing Date
04-Jun-2026
AstraZeneca embraces diversity and equality of opportunity. We are committed to building an inclusive and diverse team representing all backgrounds, with as wide a range of perspectives as possible, and harnessing industry-leading skills. We believe that the more inclusive we are, the better our work will be. We welcome and consider applications to join our team from all qualified candidates, regardless of their characteristics. We comply with all applicable laws and regulations on non-discrimination in employment (and recruitment), as well as work authorization and employment eligibility verification requirements.