Job Description

Team Introduction

The Site Reliability Engineering (SRE) team at TikTok combines software and systems engineering to build and run large-scale, massively distributed, and fault-tolerant systems. Our team is dedicated to ensuring that TikTok’s core services remain stable, efficient, and resilient at a global scale. We focus on enhancing the observability and operability of our infrastructure, using data-driven insights to safeguard business stability 24/7.

Responsibilities

As a Site Reliability Engineer, you will be responsible for the end-to-end reliability of our production ecosystem. You will balance traditional SRE functions—such as automation and performance tuning—with a specialized focus on disaster recovery and rapid incident response.

- System Design & Optimization: Participate in the full lifecycle of high-concurrency distributed systems. Collaborate with development teams to ensure services are designed for scalability, reliability, and high availability.

- Automation & Efficiency: Build and maintain robust automation tools to eliminate "toil," streamline service deployments, and manage infrastructure as code.

- Observability & Monitoring: Develop and refine monitoring, alerting, and logging systems (SLIs/SLOs) to provide deep visibility into service health and performance.

- Disaster Recovery (DR) & Resilience: Lead the design, implementation, and execution of global disaster recovery drills. You will simulate complex failure scenarios and validate failover mechanisms to ensure the platform remains operational under extreme conditions.

- Incident Management & Response: Serve as a key responder for high-priority production incidents. You will coordinate cross-functional "war rooms," drive technical troubleshooting, and lead the path to service restoration.

- Continuous Improvement: Facilitate blameless post-mortems and perform root-cause analysis (RCA). You will transform incident insights into engineering requirements to harden our systems against future outages.

- Capacity Planning: Manage resource allocation and performance bottlenecks to ensure the platform can handle organic growth and massive traffic surges.

About TikTok

Inspire Creativity and Bring Joy

Industry

Arts & Entertainment

Company Size

10,000+ employees

Headquarters

Los Angeles, California

Year Founded

Unknown

Website

tiktok.com

Social Media

Senior Site Reliability Engineer, Reliability Team - USDS

Job Description

About TikTok