Job Description
Team Introduction
The Site Reliability Engineering (SRE) team at TikTok combines software and systems engineering to build and run large-scale, massively distributed, and fault-tolerant systems. Our team is dedicated to ensuring that TikTok’s core services remain stable, efficient, and resilient at a global scale. We focus on enhancing the observability and operability of our infrastructure, using data-driven insights to safeguard business stability 24/7.
Responsibilities
As a Site Reliability Engineer, you will be responsible for the end-to-end reliability of our production ecosystem. You will balance traditional SRE functions—such as automation and performance tuning—with a specialized focus on disaster recovery and rapid incident response.
- System Design & Optimization: Participate in the full lifecycle of high-concurrency distributed systems. Collaborate with development teams to ensure services are designed for scalability, reliability, and high availability.
- Automation & Efficiency: Build and maintain robust automation tools to eliminate "toil," streamline service deployments, and manage infrastructure as code.
- Observability & Monitoring: Develop and refine monitoring, alerting, and logging systems (SLIs/SLOs) to provide deep visibility into service health and performance.
- Disaster Recovery (DR) & Resilience: Lead the design, implementation, and execution of global disaster recovery drills. You will simulate complex failure scenarios and validate failover mechanisms to ensure the platform remains operational under extreme conditions.
- Incident Management & Response: Serve as a key responder for high-priority production incidents. You will coordinate cross-functional "war rooms," drive technical troubleshooting, and lead the path to service restoration.
- Continuous Improvement: Facilitate blameless post-mortems and perform root-cause analysis (RCA). You will transform incident insights into engineering requirements to harden our systems against future outages.
- Capacity Planning: Manage resource allocation and performance bottlenecks to ensure the platform can handle organic growth and massive traffic surges.