Job Description
As our SRE charter continues to evolve, this role demands strong hands-on ownership of production reliability and troubleshooting, coupled with advanced capabilities in AI- and agentic-driven automation and performance engineering.
The Site Reliability Engineer will play a critical role in ensuring reliability, scalability, performance, and operational excellence of our platforms. The ideal candidate will leverage Azure-native AI services and agentic systems to reduce toil, improve incident response, and enable intelligent operations—while also driving performance testing practices to validate system resilience under load.
**This is a hybrid role, located at our Plano, TX office. Candidates must be willing and able to work in-office 3 days per week in Plano, TX.
Applicants must be currently authorized to work in the United States on a full-time basis and must not require sponsorship for employment visa status now or in the future
A DAY IN THE LIFE
In this role, you will…
- Own end-to-end reliability of large-scale, Azure-hosted production systems, ensuring high availability, fault tolerance, and graceful degradation
- Lead hands-on incident troubleshooting, root cause analysis (RCA), and post-incident reviews with actionable follow-ups
- Build and operate resilient, scalable services on Microsoft Azure (AKS, App Services, Functions, Event Hubs, etc.)
- Design and maintain comprehensive observability platforms using Prometheus for metrics, Loki for log aggregation, Tempo for distributed tracing, and Grafana for dashboarding and alerting
- Design, develop, and execute performance testing strategies for distributed systems and microservices, including load testing, stress testing, soak testing, and capacity planning
- Integrate AI agents with Azure monitoring stack, CI/CD tooling, and incident management platforms
- Contribute to evolving SRE standards, tooling, operational processes, and knowledge base
Responsibilities
Reliability Engineering & Production Ownership
- Own end-to-end reliability of large-scale, Azure-hosted production systems, ensuring high availability, fault tolerance, and graceful degradation
- Lead hands-on incident troubleshooting, root cause analysis (RCA), and post-incident reviews with actionable follow-ups
- Define, measure, and enforce Service Level Indicators (SLIs), Service Level Objectives (SLOs), and error budgets aligned with business outcomes
- Drive proactive reliability improvements based on operational insights, failure mode analysis, and capacity planning
- Participate in on-call rotations and take real-time ownership during production incidents
Platform & Automation Engineering
- Build and operate resilient, scalable services on Microsoft Azure (AKS, App Services, Functions, Event Hubs, etc.)
- Design and maintain comprehensive observability platforms using Prometheus for metrics, Loki for log aggregation, Tempo for distributed tracing, and Grafana for dashboarding and alerting
- Create automation to eliminate manual operational tasks and reduce Mean Time to Recovery (MTTR)
- Implement self-healing mechanisms, automated remediation workflows, and runbook automation
- Manage and optimize API lifecycle and traffic management using Gravitee API Gateway
- Design and implement durable, fault-tolerant workflows and microservice orchestration patterns using Temporal
- Administer and tune PostgreSQL databases for reliability, performance, and high availability
- Partner with application and platform teams to improve service operability, deployment safety, and change management
Performance Testing & Load Engineering
- Design, develop, and execute performance testing strategies for distributed systems and microservices, including load testing, stress testing, soak testing, and capacity planning
- Build and maintain performance test scripts and virtual user scenarios using Micro Focus LoadRunner and VuGen (Virtual User Generator)
- Analyze performance test results to identify bottlenecks, regressions, and scalability limits; produce clear reports with actionable recommendations
- Integrate performance testing into CI/CD pipelines to enable continuous performance validation and shift-left testing practices
- Establish and monitor performance baselines, benchmarks, and SLAs across critical service endpoints and user journeys
- Collaborate with development and architecture teams to resolve performance issues and optimize system throughput, latency, and resource utilization
AI / Agentic Engineering (Azure Focus)
- Design and implement AI-driven and agentic systems to enhance operational workflows and intelligent decision-making
- Build intelligent automation for operational use cases, including:
- Incident triage, enrichment, and automated escalation
- Alert correlation, deduplication, and noise reduction
- Automated diagnosis and remediation of recurring failures
- Leverage Azure AI services (Azure OpenAI, Cognitive Services, Azure ML) for operational intelligence and predictive insights
- Integrate AI agents with Azure monitoring stack, CI/CD tooling, and incident management platforms
- Ensure safe, reliable, and observable operation of AI-powered systems in production, including guardrails, fallback mechanisms, and audit trails
Collaboration & Technical Leadership
- Act as a reliability, performance, and automation champion across engineering teams
- Mentor junior SREs and influence adoption of best practices in reliability, observability, and performance engineering
- Contribute to evolving SRE standards, tooling, operational processes, and knowledge base
- Participate in architecture reviews and provide guidance on non-functional requirements (reliability, scalability, performance)
Qualifications
Core SRE Skills
- 5+ years of experience in Site Reliability Engineering, DevOps, or Production Engineering roles
- Strong hands-on experience in production troubleshooting of distributed systems at scale
- Solid understanding of Linux internals, networking (TCP/IP, DNS, HTTP, TLS), and system performance tuning
- Deep hands-on experience with Microsoft Azure (compute, networking, storage, managed services, AKS)
- Strong knowledge of Kubernetes, container orchestration, Helm charts, and microservices architectures
- Proficiency in one or more programming languages: Python, Go, Java, or equivalent
- Experience with CI/CD pipelines (Azure DevOps, GitHub Actions) and Infrastructure as Code (Terraform, ARM Templates, Bicep)
Observability & Monitoring
- Hands-on experience building and operating observability stacks using Prometheus, Grafana, Loki, and Tempo
- Experience with alerting strategies, SLI/SLO-based monitoring, and on-call incident management
Performance Testing & Load Engineering
- Proven experience designing and executing performance and load testing for large-scale distributed applications
- Hands-on proficiency with Micro Focus LoadRunner and VuGen for scripting virtual user scenarios, parameterization, correlation, and result analysis
- Strong understanding of performance testing methodologies: load testing, stress testing, endurance/soak testing, spike testing, and capacity planning
- Ability to analyze performance metrics (throughput, response time, error rate, resource utilization) and translate findings into engineering actions
- Experience integrating performance tests into automated CI/CD pipelines
Platform & Middleware
- Experience with Gravitee or equivalent API gateway platforms for traffic management, rate limiting, and API lifecycle governance
- Hands-on experience with Temporal for workflow orchestration, durable execution, and distributed task management
- Strong PostgreSQL administration skills, including query optimization, replication, backup/recovery, and performance tuning
AI / Agentic Systems
- Hands-on experience building or integrating AI-powered automation in production environments
- Experience with agent-based systems, LLM-powered workflows, Retrieval-Augmented Generation (RAG), or intelligent assistants
- Familiarity with Azure-based AI and ML services (Azure OpenAI, Cognitive Services, Azure ML)
- Understanding of reliability, safety, observability, and operational challenges of AI systems in production