Job Description
The Global E-commerce Service Architecture team ensures the availability, scalability, and resilience of TikTok’s e-commerce platform in the U.S., partnering closely with product and engineering teams to operate reliable, large-scale production systems.
We are seeking a Site Reliability Engineer (SRE) to advance the stability and resilience of TikTok Global E-commerce services in the U.S. In this role, you will strengthen disaster recovery readiness, optimize infrastructure capacity, and elevate service stability.
Key Responsibilities:
- Data Center Disaster Recovery: Ensure services maintain disaster recovery capabilities under normal operations, including contingency planning and drills, capacity assurance, and effective response in disaster scenarios.
- Resource Management & Capacity Planning: Manage and plan server and compute resources, including resource restructuring, overall capacity planning, and dynamic scaling, to support reliable business deployment and operations.
- Service Stability Improvement: Establish and enhance service monitoring systems to enable timely alerting on failures and rapid issue identification and resolution. Partner with Business stakeholders to conduct ongoing stability governance.