Bank of New Zealand

Site Reliability Engineering Manager - Reliability Operations & Hygiene

Bank of New Zealand  •  New Zealand (Onsite)  •  11 hours ago
Apply
AI can make mistakes so check important info. Chat history is never stored.

Job Description

Worker Type:

Permanent

Here at BNZ, it's about more than just banking. We work together in an agile, energising environment to create innovative solutions through our promise "If you can imagine a better future, let's find a way."

We support wellbeing, flexible working and have a generous leave offering. There is the opportunity for growth, learning and career development. No two days are the same.

We have an opportunity for a Site Reliability Engineering Manager - Reliability Operations & Hygiene to join our Technology Site Reliability Engineering (SRE) team.

This is a leadership opportunity where you will be responsible for leading our major incident, IT change, problem management, and critical operational hygiene capabilities, applying SRE thinking to improve reliability, reduce toil, strengthen automation, and increase resilience across critical services.

Mō te Tūranga | About the Role

This position leads the operating rhythm, controls, and continuous improvement of reliability operations and operational hygiene, and works with service-owning teams to improve how production services are restored, changed, stabilised, and kept resilient over time.

We sat down with our Head of Tech – Site Reliability Engineering, and they let us know the following about the role.

What are 5 day to day tasks the person in this role will complete?

  • Provide day-to-day leadership for the Reliability Operations & Hygiene team (engineers/analysts and vendor resources), including workload prioritisation, coaching, quality of execution, and removal of blockers across incident, change, problem, and hygiene activities.
  • Lead and coordinate high-severity incidents when required, providing clear incident command, managing escalation and stakeholder communications, coordinating technical recovery, and ensuring post-incident reviews produce actionable, owned follow-ups.
  • Own the effectiveness of IT change and problem management disciplines by improving change risk assessment, reducing change-related incidents, driving quality root cause analysis, and ensuring repeat issues are tracked to remediation.
  • Review service health and reliability signals such as SLOs, error budgets, incident trends, failed changes, repeat incidents, alert quality, and operational risk indicators, then turn those insights into a prioritised improvement and automation backlog.
  • Lift critical operational hygiene across services, including incident readiness, runbook quality, monitoring and alert coverage, patching and backup discipline, configuration accuracy, dependency visibility, and other controls required to support resilient production operations.

What is the most exciting thing about this opportunity?
You’ll have the opportunity to reshape how BNZ manages some of the most important controls in day-to-day service operations. This role has direct influence over how major incidents are led, how change risk is managed, how repeat issues are eliminated, and how operational hygiene is lifted across critical services. When you improve these capabilities, the impact is tangible: fewer avoidable incidents, faster restoration, stronger resilience, and better experiences for customers and colleagues.

What is the most challenging thing about this opportunity?
This is a high-trust, high-accountability role in a complex, regulated environment where service ownership is distributed across multiple teams. You’ll need to influence teams you do not directly manage, raise the standard of execution across incident, change, problem, and hygiene disciplines, and make pragmatic trade-offs between delivery speed and reliability risk. You’ll also need to lead calmly during high-severity incidents while building the longer-term controls, automation, and learning loops that prevent the same issues from happening again.

What attributes will this person display in order to be successful in this role?

We are looking for calm, decisive leadership under pressure, with the ability to take command during major incidents while keeping technical responders and stakeholders aligned. Alongside this you will have:

  • A strong ownership mindset, with a focus on improving the system rather than repeatedly managing the symptoms of operational failure.
  • Evidence-based decision-making, using incident, change, problem, and service health data to prioritise work and make practical trade-offs.
  • Cross-functional influence and collaboration, building accountability with engineering, platforms, operations, architecture, risk, and security teams.
  • A blameless learning approach, paired with the discipline to ensure post-incident and problem actions are completed and lead to lasting improvement.
  • People leadership capability that builds a team culture of operational excellence, continuous improvement, and engineering-led reliability practices.

What specific skills would be beneficial?
The role requires strong SRE practice expertise, including SLIs, SLOs, error budgets, service health indicators, toil reduction, and the use of reliability data to prioritise engineering work. You will also bring:

  • Deep experience in major incident management, including incident command, technical coordination, stakeholder communications, escalation management, and facilitation of effective post-incident reviews.
  • Working knowledge of IT change and problem management, including change risk assessment, failed change analysis, root cause analysis methods, corrective action tracking, and reduction of repeat incidents.
  • Strong understanding of observability and operational readiness, including monitoring and alert strategy, logging and tracing, runbooks, production support readiness, and meaningful reliability reporting.
  • Experience improving operational hygiene through automation and control disciplines such as patching, backup assurance, configuration management, dependency mapping, certificate management, and readiness checks for critical services.
  • Practical automation capability, such as scripting, workflow automation, or infrastructure/platform automation, with a focus on eliminating repetitive manual work and improving consistency.
  • Experience operating in regulated environments and working with risk, control, compliance, and where relevant, regulatory incident reporting obligations.

Nau Mai ki te Pēke o Aotearoa | Come to the Bank of New Zealand

This is an exciting opportunity to join us!  We're bold thinkers who are taking brave steps to create a company that people want to work for, and customers want to bank with. If you're ready to join a fun organisation where we are proud of our culture and how we are helping New Zealander's to 'Find their way', then show your interest by submitting your application - we can't wait to read it. 

Ehara taku toa i te toa takitahi, engari he toa takitini" - Success is not the work of an individual, but the work of many.”

Closing Date: 23 June 2026

Applications will be reviewed regularly across the advertising period, but we do reserve the right to close applications early.

Bank of New Zealand

About Bank of New Zealand

Welcome to the official LinkedIn page of Bank of New Zealand, Te Pēke o Aotearoa. Since our inception over 160 years ago, BNZ has helped New Zealanders find a way. ‘Finding a way’ is at the heart of everything we do. It’s who we are. Today the bank employs over 5,000 people in New Zealand, working together to help navigate the people of Aotearoa towards a better future.

If you can imagine a better future, let’s find a way.

Industry
Finance & Insurance
Company Size
5,001-10,000 employees
Headquarters
Auckland, NZ
Year Founded
1861
Website
bnz.co.nz
Social Media