ServiceLink

Site Reliability Engineer, AI & Agentic Systems

ServiceLink  •  Plano, TX (Hybrid)  •  4 hours ago
Apply
AI can make mistakes so check important info. Chat history is never stored.

Job Description

As our SRE charter continues to evolve, this role demands strong hands-on ownership of production reliability and troubleshooting, coupled with advanced capabilities in AI- and agentic-driven automation and performance engineering.

The Site Reliability Engineer will play a critical role in ensuring reliability, scalability, performance, and operational excellence of our platforms. The ideal candidate will leverage Azure-native AI services and agentic systems to reduce toil, improve incident response, and enable intelligent operations—while also driving performance testing practices to validate system resilience under load.

**This is a hybrid role, located at our Plano, TX office. Candidates must be willing and able to work in-office 3 days per week in Plano, TX.

Applicants must be currently authorized to work in the United States on a full-time basis and must not require sponsorship for employment visa status now or in the future

A DAY IN THE LIFE

In this role, you will…

  • Own end-to-end reliability of large-scale, Azure-hosted production systems, ensuring high availability, fault tolerance, and graceful degradation
  • Lead hands-on incident troubleshooting, root cause analysis (RCA), and post-incident reviews with actionable follow-ups
  • Build and operate resilient, scalable services on Microsoft Azure (AKS, App Services, Functions, Event Hubs, etc.)
  • Design and maintain comprehensive observability platforms using Prometheus for metrics, Loki for log aggregation, Tempo for distributed tracing, and Grafana for dashboarding and alerting
  • Design, develop, and execute performance testing strategies for distributed systems and microservices, including load testing, stress testing, soak testing, and capacity planning
  • Integrate AI agents with Azure monitoring stack, CI/CD tooling, and incident management platforms
  • Contribute to evolving SRE standards, tooling, operational processes, and knowledge base

Responsibilities

Reliability Engineering & Production Ownership

  • Own end-to-end reliability of large-scale, Azure-hosted production systems, ensuring high availability, fault tolerance, and graceful degradation
  • Lead hands-on incident troubleshooting, root cause analysis (RCA), and post-incident reviews with actionable follow-ups
  • Define, measure, and enforce Service Level Indicators (SLIs), Service Level Objectives (SLOs), and error budgets aligned with business outcomes
  • Drive proactive reliability improvements based on operational insights, failure mode analysis, and capacity planning
  • Participate in on-call rotations and take real-time ownership during production incidents

Platform & Automation Engineering

  • Build and operate resilient, scalable services on Microsoft Azure (AKS, App Services, Functions, Event Hubs, etc.)
  • Design and maintain comprehensive observability platforms using Prometheus for metrics, Loki for log aggregation, Tempo for distributed tracing, and Grafana for dashboarding and alerting
  • Create automation to eliminate manual operational tasks and reduce Mean Time to Recovery (MTTR)
  • Implement self-healing mechanisms, automated remediation workflows, and runbook automation
  • Manage and optimize API lifecycle and traffic management using Gravitee API Gateway
  • Design and implement durable, fault-tolerant workflows and microservice orchestration patterns using Temporal
  • Administer and tune PostgreSQL databases for reliability, performance, and high availability
  • Partner with application and platform teams to improve service operability, deployment safety, and change management

Performance Testing & Load Engineering

  • Design, develop, and execute performance testing strategies for distributed systems and microservices, including load testing, stress testing, soak testing, and capacity planning
  • Build and maintain performance test scripts and virtual user scenarios using Micro Focus LoadRunner and VuGen (Virtual User Generator)
  • Analyze performance test results to identify bottlenecks, regressions, and scalability limits; produce clear reports with actionable recommendations
  • Integrate performance testing into CI/CD pipelines to enable continuous performance validation and shift-left testing practices
  • Establish and monitor performance baselines, benchmarks, and SLAs across critical service endpoints and user journeys
  • Collaborate with development and architecture teams to resolve performance issues and optimize system throughput, latency, and resource utilization

AI / Agentic Engineering (Azure Focus)

  • Design and implement AI-driven and agentic systems to enhance operational workflows and intelligent decision-making
  • Build intelligent automation for operational use cases, including:
    • Incident triage, enrichment, and automated escalation
    • Alert correlation, deduplication, and noise reduction
    • Automated diagnosis and remediation of recurring failures
  • Leverage Azure AI services (Azure OpenAI, Cognitive Services, Azure ML) for operational intelligence and predictive insights
  • Integrate AI agents with Azure monitoring stack, CI/CD tooling, and incident management platforms
  • Ensure safe, reliable, and observable operation of AI-powered systems in production, including guardrails, fallback mechanisms, and audit trails

Collaboration & Technical Leadership

  • Act as a reliability, performance, and automation champion across engineering teams
  • Mentor junior SREs and influence adoption of best practices in reliability, observability, and performance engineering
  • Contribute to evolving SRE standards, tooling, operational processes, and knowledge base
  • Participate in architecture reviews and provide guidance on non-functional requirements (reliability, scalability, performance)

Qualifications

Core SRE Skills

  • 5+ years of experience in Site Reliability Engineering, DevOps, or Production Engineering roles
  • Strong hands-on experience in production troubleshooting of distributed systems at scale
  • Solid understanding of Linux internals, networking (TCP/IP, DNS, HTTP, TLS), and system performance tuning
  • Deep hands-on experience with Microsoft Azure (compute, networking, storage, managed services, AKS)
  • Strong knowledge of Kubernetes, container orchestration, Helm charts, and microservices architectures
  • Proficiency in one or more programming languages: Python, Go, Java, or equivalent
  • Experience with CI/CD pipelines (Azure DevOps, GitHub Actions) and Infrastructure as Code (Terraform, ARM Templates, Bicep)

Observability & Monitoring

  • Hands-on experience building and operating observability stacks using Prometheus, Grafana, Loki, and Tempo
  • Experience with alerting strategies, SLI/SLO-based monitoring, and on-call incident management

Performance Testing & Load Engineering

  • Proven experience designing and executing performance and load testing for large-scale distributed applications
  • Hands-on proficiency with Micro Focus LoadRunner and VuGen for scripting virtual user scenarios, parameterization, correlation, and result analysis
  • Strong understanding of performance testing methodologies: load testing, stress testing, endurance/soak testing, spike testing, and capacity planning
  • Ability to analyze performance metrics (throughput, response time, error rate, resource utilization) and translate findings into engineering actions
  • Experience integrating performance tests into automated CI/CD pipelines

Platform & Middleware

  • Experience with Gravitee or equivalent API gateway platforms for traffic management, rate limiting, and API lifecycle governance
  • Hands-on experience with Temporal for workflow orchestration, durable execution, and distributed task management
  • Strong PostgreSQL administration skills, including query optimization, replication, backup/recovery, and performance tuning

AI / Agentic Systems

  • Hands-on experience building or integrating AI-powered automation in production environments
  • Experience with agent-based systems, LLM-powered workflows, Retrieval-Augmented Generation (RAG), or intelligent assistants
  • Familiarity with Azure-based AI and ML services (Azure OpenAI, Cognitive Services, Azure ML)
  • Understanding of reliability, safety, observability, and operational challenges of AI systems in production
ServiceLink

About ServiceLink

ServiceLink is the premier national provider of mortgage services. ServiceLink delivers valuation, title and closing, and flood services to mortgage originators; end-to-end subservicing to mortgage servicers; and default valuation, integrated default title services, vendor invoicing and claims audit services as well as auction services to mortgage servicers.

ServiceLink helps clients in the lending industry and beyond achieve their strategic goals, realize greater efficiencies, and better serve their customers by delivering best-in-class technology, services, and insight with a relentless commitment to upholding the highest standards of quality, compliance, and service.

For more information about ServiceLink, please visit https://www.servicelink.com/.

Industry
Finance & Insurance
Company Size
1,001-5,000 employees
Headquarters
Moon Township, PA
Year Founded
Unknown
Social Media