Job Description

About FWD Group

FWD Group (1828.HK) is a pan-Asian life and health insurance business that serves approximately 40 million customers across 10 markets, including BRI Life in Indonesia. FWD’s customer-led and tech-enabled approach aims to deliver innovative propositions, easy-to-understand products and a simpler insurance experience. Established in 2013, the company operates in some of the fastest-growing insurance markets in the world with a vision of changing the way people feel about insurance. FWD Group is listed on the main board of the Hong Kong Stock Exchange under the stock code 1828.

For more information, please visit www.fwd.com

Purpose

Own the Group-wide strategy, policy, and outcomes for IT resilience, service reliability, and platform modernization across FWD Group’s infrastructure, cloud platforms, and core system applications
Set reliability objectives, govern error-budget policy, and hold decision rights over production change risk for NB and Customer facing related services
Chair the Group Resilience Council and drive a federated SRE operating model across Group Office and Business Units (BUs)
Strategize, Design, enforce and govern the group resilience standard, all systems must be Highly Availability, with DR plan, with failover plan
Lead modernization of legacy platforms and production services by defining target-state architectures, resilience patterns, upgrade roadmaps, and remediation priorities to improve availability, scalability, security, and maintainability
IT SRE provides advice to different teams on how to fix P1/P2 RCAs, drives troubleshooting / analyze / identify where in the code need to be fixed and oversee the entire troubleshooting process
Drive modernization through observability, automation, SRE practices, and engineering enablement, ensuring incident learnings translate into platform hardening, architectural simplification, and faster, safer delivery
Act as multi-SME / Generalist team, SMEs of different areas ( Security, Network, Cloud, Application, Infrastructure, etc)
IT SRE manages and owns the new P1/2 escalation protocol

Key accountabilities

Modernization

Define and drive the modernization roadmap for core system applications, including lifecycle management, upgrade strategy, technical debt reduction, platform simplification, and resilience-by-design requirements
Lead modernization reviews for core systems to assess architecture fitness, recoverability, scalability, supportability, and security, and translate incident learnings into prioritized remediation and refactoring plans
Establish modernization guardrails for core application estates covering observability, automation, release engineering, resilience patterns, decommission planning, and adoption of cloud-native or platform-standard capabilities where appropriate
Enterprise governance & policy: Develop and own SRE Standards, Error Budget Policy, On-Call & Incident Command framework; enforce release gates based on reliability risk
Enterprise governance & policy: Develop and own Group standards covering resilience, modernization guardrails, error budgets, on-call and incident command frameworks, and production change controls; enforce release gates based on reliability and modernization risk
Reduce MTTR and incident recurrence; scale SLO coverage to ≥ 90% of critical services
Drive reliability‑by‑design reviews for NB and Customer Facing system changes, preventing recurrence through architectural guardrails and automated release gates
Platform ownership: Product-own Observability & AIOps platforms and drive modernization enablers including telemetry standards, engineering guardrails, automated release controls, and reusable patterns for cloud-native adoption
Resilience: Approve DR tiers (RTO/RPO), lead chaos/DR exercises, and ensure cyber-resilience alignment with Security
Financials: Own SRE platform budget, FinOps targets, and vendor SLA outcomes
Org & talent: Build a global SRE leadership bench, run the SRE Academy, and operate a follow-the-sun model with healthy on-call
Stakeholder engagement: Prepare and present executive reporting to GMT on production reliability risk and major incidents
Provide leadership and decisioning across Group IT, local BUs IT, local BUs users and Group Digital & Data team by providing troubleshooting direction and approach
Provide thought leadership in performing root case analysis and develop long-term prevention measures
Governance Framework : Develop publish Group SRE Standards including SLO/SLI definitions, error budgets, release gates, on-call health, and post-incident review policies, track and monitor the standard is executed across Group IT, local BU, Group Digital & Data and the relevant stakeholders
Engage and intimately involved in technical leadership decision-making and collaboration with other key Technology leaders within Group and local BUs in setting the governance framework in enhancing SRE and production reliability
Embedded Chapters : Create embedded SRE chapters within each BU and function (e.g., Group Digital), supported by a central platform SRE team. This ensures local ownership while maintaining global consistency
Accountable to maintain and comprehensive understanding of the multi facets and disciplines that underly the importance of maintaining SRE and production reliability of our platforms, systems and applications in order to achieve our business goal. This will require an underlying understanding of business principles, related to IT solutions, regulatory compliance, and data security compliance.
Executive Dashboards and Escalation Protocols : Create executive dashboards and present to GMT and Excos on production risk and customer impact, collect feedback from GMT and Excos to strive for continuous improvement
Design and responsible for a tiered escalation protocol for P1/P2 incidents (CTO → Senior MD → CEO) to ensure visibility and urgency
Training and Culture: Operate the SRE Academy to build internal capability and promote a shared understanding of reliability principles
Promote blameless post-incident reviews (PIRs) to foster trust and continuous improvement

Qualifications / Experience

Bachelor’s or Master’s degree in Computer Science, Information Technology, Engineering, or related field
15+ years in large-scale infrastructure/platform/SRE, with 8–10+ years leading managers across multi-region operations
Proven track record leading enterprise resilience and modernization transformations, including legacy remediation, cloud adoption, platform standardization, and operating model change in multinational environments
Budget ownership and vendor/commercial leadership; experience reporting to GMT on production risk
Deep expertise with multi-cloud (AWS/Azure/GCP), Kubernetes, CI/CD, Open Telemetry, and enterprise observability
Familiarity with ITIL 4, ISO 22301/27001, incident response (NIST 800-61), and audit frameworks relevant to insurance

Knowledge & Technical skills

SRE practices at scale (SLO/SLI design, error budgets, chaos engineering, capacity management)
Observability stacks and AIOps (e.g., Elastic, Datadog, Dynatrace; correlation and anomaly detection)
Strong understanding of cloud platforms, container orchestration, CI/CD, API and integration patterns, infrastructure-as-code, and modernization approaches such as refactoring, replatforming, and decomposing legacy services.
Understanding of IT operations in Insurance, BCP/DR, and regulatory expectations
Excellent communication and stakeholder management across Group and BUs; fluent in English
Chinese written proficiency preferred

About FWD Insurance

FWD Group (1828.HK) is a pan-Asian life and health insurance business that serves approximately 34 million customers across 10 markets, including BRI Life in Indonesia. FWD’s customer-led and tech-enabled approach aims to deliver innovative propositions, easy-to-understand products and a simpler insurance experience. Established in 2013, the company operates in some of the fastest-growing insurance markets in the world with a vision of changing the way people feel about insurance. FWD Group is listed on the main board of the Hong Kong Stock Exchange under the stock code 1828.

Industry

Finance & Insurance

Company Size

10,000+ employees

Headquarters

Quarry Bay, HK

Year Founded

2013

Website

fwd.com

Social Media

Director, IT Resilience and Modernization Lead

Job Description

About FWD Insurance