Job Description

What is special about Lighthouse?

Lighthouse is built on a foundation of unique, compassionate, highly driven individuals. We elevate the strengths and talents of those around us while leveraging opportunities for growth. We offer the experience of solving complex problems while continuing to grow multiple facets of your career. Lighthouse is where innovation meets support and where collaboration is the key ingredient to success. We grow together and are stronger together.

What’s unique about this role?

The Senior Cloud Site Reliability Engineer (Senior Cloud SRE) is responsible for ensuring the reliability, scalability, availability, performance, security, and operational excellence of Lighthouse’s cloud platforms and critical product infrastructure.

This role combines software engineering, cloud engineering, automation, observability, and operational governance practices to build highly resilient and self-healing platforms across hybrid and cloud-native environments. The ideal candidate will drive SRE best practices, improve service reliability through automation, establish observability standards, and partner closely with Engineering, Product, Security, DBA, and DevEx teams to improve operational maturity across the organization.

The role requires deep expertise in cloud infrastructure, Kubernetes, DevOps/SRE principles, telemetry, incident management, monitoring, and automation, along with strong collaboration and communication skills.

What will this person do?

Site Reliability Engineering & Operational Excellence

Drive and implement Site Reliability Engineering (SRE) best practices across cloud platforms and services.
Define, maintain, and improve:
- Service Level Indicators (SLIs)
- Service Level Objectives (SLOs)
- Service Level Agreements (SLAs)
- Error Budgets
Improve service reliability, resiliency, scalability, and operational efficiency.
Establish operational standards, reliability governance, and production readiness practices.
Conduct Root Cause Analysis (RCA), postmortems, and reliability improvement initiatives.
Participate in on-call rotations, incident management, and major incident resolution activities.
Continuously improving operational processes to reduce Mean Time to Detect (MTTD) and Mean Time to Resolve (MTTR)

Observability, Monitoring & Telemetry

Design, implement, and maintain enterprise observability and telemetry platforms.
Build operational dashboards, reliability scorecards, and service health monitoring solutions.
Configure proactive alerting, anomaly detection, and incident correlation mechanisms.
Implement centralized monitoring and telemetry using:
- Grafana
- Prometheus
- Azure Monitor
- Log Analytics
- ELK Stack / ElasticSearch
- Power BI dashboards
Develop actionable operational metrics and telemetry reporting for engineering and leadership teams.
Enhance visibility into infrastructure, application, Kubernetes, and platform health.

Automation & Auto-Healing Engineering

Drive automation-first operational practices across infrastructure and platform services.
Develop Infrastructure-as-Code (IaC) solutions using:
- Terraform
- ARM/Bicep
- Ansible
Build operational automation scripts using:
- Python
- Bash
- PowerShell
Develop self-healing and auto-remediation capabilities for recurring operational incidents.
Automate infrastructure provisioning, monitoring, scaling, backup, recovery, and deployment workflows.
Reduce manual operational effort and improve engineering productivity through intelligent automation.

Collaboration & Engineering Partnership

Collaborate closely with:
- Cloud Engineering teams
- Product Engineering teams
- DevEx teams
- Security teams
- DBA teams
- Operations teams
Support engineering teams in improving production readiness and operational maturity.
Contribute to continuous improvement initiatives, reliability reviews, and operational excellence programs.

Bring your passion and together we will shine. It would also be great if you had the following:

Bachelor’s degree in Computer Science, Engineering, or related field (or equivalent experience/certification).
Knowledge of Python, scripting, or Infrastructure-as-Code tools (e.g., Terraform, Ansible, ARM/Bicep).
Experience managing cloud platforms (e.g., Azure, AKS, Pivotal Cloud Foundry, or equivalent).
Strong understanding of Kubernetes and containerization concepts.
Experience with application packaging, deployment automation, and release management.
Solid knowledge of relational databases (MS-SQL) and exposure to NoSQL technologies (e.g., Redis, ElasticSearch, MongoDB).
Experience with CI/CD tools (Azure DevOps, Jenkins, GitHub Actions, or similar).
Familiarity with monitoring and logging tools (Grafana, ELK stack, Prometheus, PowerBI, etc.).
Proficiency with Git and modern branching/merging workflows.
Strong Linux administration and troubleshooting skills.
Excellent problem-solving, communication, and teamwork skills.

Work Environment and Physical Demands

Duties are performed in a typical office environment while at a desk or computer table.
Duties require the ability to use a computer, communicate over the telephone, and read printed material, in a quiet and professional setting.
Duties may require being on call periodically and working outside normal working hours (evenings and weekends).

Lighthouse celebrates and thrives on diversity and is an Equal Opportunity Employer. We hire, train, and promote regardless of race, religion, color, national origin, sex, disability, age, veteran status, and other protected status as required by applicable law. We welcome any talents and contributions you can bring to the team and are deeply committed to growing an environment where everyone can feel safe, is respected, and can show up as themselves. Come as you are!

As required by applicable pay transparency laws, Lighthouse complies with compensation disclosure requirements for roles that may be hired in locations under these requirements. Factors that may be used to determine your actual salary may include a wide array of factors, including: your specific skills and experience, geographic location, or other relevant factors. The salary range for this position may be tailored to be lower or higher in different talent markets.

This role will be eligible to participate in an annual bonus or incentive program.

As a trailblazer and catalyst for change, Lighthouse rises to each opportunity to help our clients, and our people do what they do best—shine.

This position will work for and be employed by Lighthouse's India subsidiary, which is an independent company located in India.

About Lighthouse

Since our inception as a local document copy shop in 1995, Lighthouse has evolved with the legal technology landscape, anticipating the trends that shape legal practices, information management, and complex eDiscovery.

Over time, we’ve built our own technology to increase efficiency and address gaps. We’ve brought leading providers Discovia and H5 under the Lighthouse umbrella to expand our investigations and review capabilities. We also have lasting partnerships with leading vendors like Relativity, Brainspace, and Nuix, creating best-in-class workflows boosted by exclusive Lighthouse features. Most importantly, Lighthouse combines innovation, expertise, and a commitment to partnership in every service we provide.

Industry

Legal & Compliance

Company Size

1,001-5,000 employees

Headquarters

Seattle, WA

Year Founded

Unknown

Website

lighthouseglobal.com

Social Media

Senior Cloud Site Reliability Engineer AP

Job Description

About Lighthouse