Job Description
Enabling safe and rewarding digital lives for genuine people, everywhere
We make it our mission to ensure more genuine people have digital access to opportunities, and businesses have access to more genuine people. Our technology draws on diverse and reliable data to create a single point of truth for identity and address verification.
With over 30 years of experience behind us our team and technology are focused on enabling safe and rewarding digital lives for everyone. Regardless of age, location or background, genuine people everywhere should be able to digitally prove who they are and where they live.
About the team and role
Global Fraud Solutions
The team provides decision support solutions to address business objectives in risk prevention and fraud detection. We deliver software solutions and offer client support using our expertise and a client-focused approach.
Site Reliability Engineer
The SRE will build and operate the reliability, observability, and operational excellence infrastructure underpinning the GFS managed fraud detection platforms. You will work across deployment pipelines, cloud infrastructure, monitoring, and incident management — ensuring GBG can deliver on high availability SLAs for banking and fintech customers who depend on real-time fraud detection at scale.
What you will do
- Design and operate the SRE practice for Managed oferings, including on-call processes, SLA frameworks, incident response playbooks, and post-incident review (PIR) processes.
- Build and maintain observability infrastructure: centralised logging (correlation IDs), metrics dashboards, distributed tracing, and alerting for the Predator/Instinct platform stack.
- Define and track SLOs (Service Level Objectives) and error budgets for real-time transaction processing pipelines, targeting high TPS and low round-trip latency.
- Manage cloud infrastructure provisioning and configuration using IaC tooling (Terraform, Helm), supporting both AWS/Azure cloud deployments and on-premises customer environments.
- Implement and maintain CI/CD pipelines for GFS solutions (Jenkins, etc.)
- Work with Engineering teams to ensure security and compliance readiness for Managed services — including PCI DSS, ISO 27001, SOC 1/2/3, PDPA/GDPR — in close coordination with InfoSec teams.
- Drive platform resilience improvements: high availability, auto-scaling, disaster recovery, backup/restore procedures, and chaos engineering practices.
- Manage secrets, certificate rotation, identity/access controls (OAuth/RBAC), and vulnerability management for the hosted environment.
- Support performance testing methodology and baseline establishment for our products.
- Contribute to the Architecture Review Committee (ARC) with SRE and operational perspectives on technology choices.
- Collaborate with engineering squads to embed reliability and DevSecOps practices across the SDLC.
Skills we’re looking for
- Minimum 5 years of solid hands-on experience in a Site Reliability, Platform Engineering, or DevOps role, ideally supporting mission-critical real-time processing systems in banking, payments, or fintech.
- Strong proficiency with cloud platforms (AWS preferred; Azure/GCP acceptable) including networking, compute, storage, and managed services.
- Deep expertise with containerisation and orchestration: Docker, Kubernetes (EKS/AKS/GKE), Helm, and associated tooling.
- Infrastructure as Code experience: Terraform (required), and familiarity with Ansible or Pulumi.
- Observability stack proficiency: Prometheus, Grafana, ELK/OpenSearch, Jaeger/Zipkin, or equivalent enterprise-grade tooling.
- CI/CD pipeline design and management: GitHub Actions, Jenkins, ArgoCD, or equivalent.
- Experience with security and compliance frameworks applicable to hosted financial services: PCI DSS, ISO 27001, SOC 1/2/3, GDPR/PDPA.
- Familiarity with database reliability practices for SQL Server, PostgreSQL, and Oracle — including replication, read replicas, and failover.
- Working knowledge of secrets management (HashiCorp Vault, AWS Secrets Manager) and zero-trust identity principles.
- Experience supporting real-time streaming or event-driven architectures (Kafka, RisingWave, or similar) in production environments.
- Scripting and automation proficiency: Python, Bash, or Go for operational tooling.
- Strong sense of operational ownership and accountability — comfortable being on-call and driving incidents to resolution.
- Excellent communication skills — able to produce clear incident reports, runbooks, and architecture documentation for both technical and executive audiences.
- Proactive mindset: identifies reliability risks before they become incidents and champions a culture of blameless post-mortems.
- Collaborative and effective working with software engineers, product managers, and InfoSec teams.
- Continuous improvement orientation — always looking to reduce toil, automate repetitive tasks, and improve platform resilience.
- Flexible and adaptable — able to support a globally distributed product with customers across multiple time zones.
To find out more
As an equal opportunity employer, we are dedicated to creating a diverse and inclusive workplace where everyone feels valued and empowered. Please inform your GBG Talent Attraction Partner if you require any reasonable adjustments to the interview process.
To chat to the Talent Attraction team and find out more about our benefits and why we’re a great place to work, drop an email to behired@gbgplc.com and we’ll be in touch. You can also find out more about careers at GBG and check out our current opportunities at gbgplc.com/careers.