TikTok

Incident Response Manager - Data Center

TikTok  •  San Jose, CA (Onsite)  •  2 months ago
Apply
AI can make mistakes so check important info. Chat history is never stored.

Job Description

About the team

The Data Systems Infrastructure (DSI) team sits within the global technology structure and supports the company's fast growth by building and operating hyper-scale datacenters, managing the life cycle of server fleet, providing cloud solutions, and developing various infrastructure services, making sure they are scalable and are reliable.

Job Description

We are seeking a technically skilled and detail-oriented professional to serve as a front-line responder for incident detection, triage, and response across infrastructure, facilities, and security operations. The ideal candidate will possess a solid foundation in IT, infrastructure, or engineering disciplines, with experience in critical environments and the ability to analyze incidents, identify patterns, and drive long-term improvements. This role requires composed performance, data-driven thinking, and a proactive approach to continuous improvement and operational resilience.

Responsibilities

- Serve as the first responder in the IRC Operation Center, detecting and responding to events across infrastructure, facilities using tools such as Server Automation, Data Center Infrastructure Management, Network monitoring, Grafana, and related systems.

- Respond promptly to events including but not limited to:

- Environmental systems (e.g. high temperature, humidity, power fluctuations or failures)

- IT infrastructure (e.g. server performance issues, network outages, system failures)

- Facility and environmental alerts relevant to operations.

- External Facing Services (e.g. colocation maintenance notices, service requests from CDN partners, and critical notifications)

- Conduct detailed investigations to diagnose the root cause of events, assess their impact, and determine appropriate response actions.

- Monitor and analyze detected events, accurately classify incidents based on potential or actual customer impact, and proactively communicate risks. Coordinate timely escalations by notifying and collaborating with relevant support teams to ensure swift incident resolution.

- Monitor incident response performance against agreed SLAs, ensuring timely alerts and notifications.

- Manage incidents calmly and efficiently, performing in-depth investigations to determine root causes and impacts, while promptly engaging and coordinating with the designated resolver teams to facilitate timely resolution.

- Draft detailed incident reports and conduct post-mortem reviews to document lessons learned.

- Generate regular reports to deliver comprehensive insights into the effectiveness of incident response and recovery processes.

- Analyze trends and patterns in events to identify opportunities for improvement and optimization

- Own and drive the Incident, Problem, and Change Management processes in alignment with ITIL or internal ITSM frameworks.

- Develop and maintain a comprehensive library of Standard Operating Procedures (SOPs), Methods of Procedure (MOPs), runbooks, and operational guides to ensure consistency and readiness across teams.

- Lead or support continuous improvement projects aimed at enhancing incident response capabilities, operational security, system reliability, and overall infrastructure performance. Collaborate with cross-functional teams to implement engineering solutions and process optimizations.

- Provide technical and operational leadership to the incident response center team, ensuring consistent performance and adherence to best practices.
TikTok

About TikTok

Inspire Creativity and Bring Joy

Industry
Arts & Entertainment
Company Size
10,000+ employees
Headquarters
Los Angeles, California
Year Founded
Unknown
Social Media