Job Description
About FWD Group
FWD Group (1828.HK) is a pan-Asian life and health insurance business that serves approximately 40 million customers across 10 markets, including BRI Life in Indonesia. FWD’s customer-led and tech-enabled approach aims to deliver innovative propositions, easy-to-understand products and a simpler insurance experience. Established in 2013, the company operates in some of the fastest-growing insurance markets in the world with a vision of changing the way people feel about insurance. FWD Group is listed on the main board of the Hong Kong Stock Exchange under the stock code 1828.
For more information, please visit www.fwd.com
Purpose
- Own the Group-wide strategy, policy, and outcomes for IT resilience, service reliability, and platform modernization across FWD Group’s infrastructure, cloud platforms, and core system applications
- Set reliability objectives, govern error-budget policy, and hold decision rights over production change risk for NB and Customer facing related services
- Chair the Group Resilience Council and drive a federated SRE operating model across Group Office and Business Units (BUs)
- Strategize, Design, enforce and govern the group resilience standard, all systems must be Highly Availability, with DR plan, with failover plan
- Lead modernization of legacy platforms and production services by defining target-state architectures, resilience patterns, upgrade roadmaps, and remediation priorities to improve availability, scalability, security, and maintainability
- IT SRE provides advice to different teams on how to fix P1/P2 RCAs, drives troubleshooting / analyze / identify where in the code need to be fixed and oversee the entire troubleshooting process
- Drive modernization through observability, automation, SRE practices, and engineering enablement, ensuring incident learnings translate into platform hardening, architectural simplification, and faster, safer delivery
- Act as multi-SME / Generalist team, SMEs of different areas ( Security, Network, Cloud, Application, Infrastructure, etc)
- IT SRE manages and owns the new P1/2 escalation protocol
Key accountabilities
Modernization
- Define and drive the modernization roadmap for core system applications, including lifecycle management, upgrade strategy, technical debt reduction, platform simplification, and resilience-by-design requirements
- Lead modernization reviews for core systems to assess architecture fitness, recoverability, scalability, supportability, and security, and translate incident learnings into prioritized remediation and refactoring plans
- Establish modernization guardrails for core application estates covering observability, automation, release engineering, resilience patterns, decommission planning, and adoption of cloud-native or platform-standard capabilities where appropriate
- Enterprise governance & policy: Develop and own SRE Standards, Error Budget Policy, On-Call & Incident Command framework; enforce release gates based on reliability risk
- Enterprise governance & policy: Develop and own Group standards covering resilience, modernization guardrails, error budgets, on-call and incident command frameworks, and production change controls; enforce release gates based on reliability and modernization risk
- Reduce MTTR and incident recurrence; scale SLO coverage to ≥ 90% of critical services
- Drive reliability‑by‑design reviews for NB and Customer Facing system changes, preventing recurrence through architectural guardrails and automated release gates
- Platform ownership: Product-own Observability & AIOps platforms and drive modernization enablers including telemetry standards, engineering guardrails, automated release controls, and reusable patterns for cloud-native adoption
- Resilience: Approve DR tiers (RTO/RPO), lead chaos/DR exercises, and ensure cyber-resilience alignment with Security
- Financials: Own SRE platform budget, FinOps targets, and vendor SLA outcomes
- Org & talent: Build a global SRE leadership bench, run the SRE Academy, and operate a follow-the-sun model with healthy on-call
- Stakeholder engagement: Prepare and present executive reporting to GMT on production reliability risk and major incidents
- Provide leadership and decisioning across Group IT, local BUs IT, local BUs users and Group Digital & Data team by providing troubleshooting direction and approach
- Provide thought leadership in performing root case analysis and develop long-term prevention measures
- Governance Framework : Develop publish Group SRE Standards including SLO/SLI definitions, error budgets, release gates, on-call health, and post-incident review policies, track and monitor the standard is executed across Group IT, local BU, Group Digital & Data and the relevant stakeholders
- Engage and intimately involved in technical leadership decision-making and collaboration with other key Technology leaders within Group and local BUs in setting the governance framework in enhancing SRE and production reliability
- Embedded Chapters : Create embedded SRE chapters within each BU and function (e.g., Group Digital), supported by a central platform SRE team. This ensures local ownership while maintaining global consistency
- Accountable to maintain and comprehensive understanding of the multi facets and disciplines that underly the importance of maintaining SRE and production reliability of our platforms, systems and applications in order to achieve our business goal. This will require an underlying understanding of business principles, related to IT solutions, regulatory compliance, and data security compliance.
- Executive Dashboards and Escalation Protocols : Create executive dashboards and present to GMT and Excos on production risk and customer impact, collect feedback from GMT and Excos to strive for continuous improvement
- Design and responsible for a tiered escalation protocol for P1/P2 incidents (CTO → Senior MD → CEO) to ensure visibility and urgency
- Training and Culture: Operate the SRE Academy to build internal capability and promote a shared understanding of reliability principles
- Promote blameless post-incident reviews (PIRs) to foster trust and continuous improvement
Qualifications / Experience
- Bachelor’s or Master’s degree in Computer Science, Information Technology, Engineering, or related field
- 15+ years in large-scale infrastructure/platform/SRE, with 8–10+ years leading managers across multi-region operations
- Proven track record leading enterprise resilience and modernization transformations, including legacy remediation, cloud adoption, platform standardization, and operating model change in multinational environments
- Budget ownership and vendor/commercial leadership; experience reporting to GMT on production risk
- Deep expertise with multi-cloud (AWS/Azure/GCP), Kubernetes, CI/CD, Open Telemetry, and enterprise observability
- Familiarity with ITIL 4, ISO 22301/27001, incident response (NIST 800-61), and audit frameworks relevant to insurance
Knowledge & Technical skills
- SRE practices at scale (SLO/SLI design, error budgets, chaos engineering, capacity management)
- Observability stacks and AIOps (e.g., Elastic, Datadog, Dynatrace; correlation and anomaly detection)
- Strong understanding of cloud platforms, container orchestration, CI/CD, API and integration patterns, infrastructure-as-code, and modernization approaches such as refactoring, replatforming, and decomposing legacy services.
- Understanding of IT operations in Insurance, BCP/DR, and regulatory expectations
- Excellent communication and stakeholder management across Group and BUs; fluent in English
- Chinese written proficiency preferred