Photon

SRE Lead | Guadalajara,Mexico

Photon  •  Mexico (Onsite)  •  5 hours ago
Apply
AI can make mistakes so check important info. Chat history is never stored.

Job Description

Site Reliability Engineers are responsible for ensuring the availability, reliability, scalability, and performance of the firm’s most critical, customer-facing microservices that power all eCommerce channels. This role applies Google-inspired SRE principles to balance feature velocity and system reliability using Service Level Objectives (SLOs), Service Level Indicators (SLIs), and error budgets.

The role combines software engineering, cloud engineering, automation, and production operations, with a strong emphasis on building systems that are observable, resilient, and operable by default.

Primary Responsibilities

Define, implement, and own SLIs, SLOs, and error budgets for critical microservices in collaboration with product and engineering teams.

Use error budgets to influence release decisions, prioritize reliability work, and manage operational risk.

Design and maintain observability platforms including metrics, logs, traces, and real-time telemetry.

Track, manage, and reduce operational toil by converting repetitive operational work into Jira stories and epics with clear ownership and measurable outcomes.

Design, implement, and validate resiliency mechanisms such as graceful degradation, redundancy, automated failover, and disaster recovery.

Lead incident response, act as an escalation point for high-severity incidents, and drive blameless postmortems.

Capture incident action items and reliability improvements in Jira, ensuring closure, accountability, and continuous improvement.

Partner with scrum teams to improve reliability through release readiness reviews, production change validation, and testing strategies.

Perform deep root cause analysis, debugging, and performance tuning across distributed systems.

Promote shift-left reliability by embedding operability, monitoring, and failure testing early in the SDLC.

Drive continuous improvement through automation, self-healing systems, chaos engineering, and capacity planning.

Maintain runbooks, playbooks, and knowledge repositories, linking documentation to Jira tasks to reduce MTTR.

Provide technical leadership and mentoring to junior SREs and engineers.

Collaborate with global, distributed teams, leveraging Jira for transparent planning, dependency tracking, and execution.

Core Competencies & Accomplishments

6+ years of experience in SRE, software engineering, or production operations supporting large-scale eCommerce platforms.

Hands-on experience with Java/J2EE-based distributed systems. React experience is a plus.

Proven ability to design and operate systems using SLO-driven reliability models.

Experience defining and measuring SLIs (availability, latency, error rates, throughput, saturation).

Good understanding with NoSQL technologies and RDBMS. Should be able to write queries to fetch results from database.

Experience deploying and operating services on cloud platforms (AWS, Azure, or Google Cloud).

Expertise with observability, APM, and caching tools (Dynatrace, Splunk, ELK, Akamai, QuantumMetric/Tealeaf, etc.).

Strong experience using Jira for backlog management, incident follow-ups, toil reduction tracking, and cross-team coordination.

Ability to independently own services and drive reliability initiatives end-to-end.

Strong communication skills and ability to influence engineering and product teams.

Experience being on On-Call rotation and handling critical/high incidents.


Desired Skills
Experience building and operating microservices architectures using Spring Boot, Groovy, React, or similar.

Strong understanding of CI/CD pipelines, release automation, and progressive delivery.

Experience with eCommerce domains such as Catalog, Customer Data, and Order Management.

Familiarity with search platforms (Endeca, Solr, Lucene, Elasticsearch).

Proficiency in scripting and automation (Python, Bash, Ruby, Perl, PowerShell).

Experience with ITSM tools integrated with Jira workflows.

Exposure to capacity planning, load testing, and chaos engineering.

Photon

About Photon

Photon, a global leader in AI and digital solutions, helps clients accelerate AI adoption and embrace DigitalHyper-expansion® to ‘make tomorrow happen today’. We work with 40% of the Fortune 100, enabling them to stay agile and future-ready in an era of converging digital and AI boundaries. Powering billions of touchpoints a day, Photon combines AI management, digital innovation, product design thinking, and engineering excellence to drive lasting transformation for F500 clients. We employ several thousand people across dozens of countries. Learn more at www.photon.com

Industry
IT & Software
Company Size
5,001-10,000 employees
Headquarters
London, GB
Year Founded
2007
Social Media