ByteDance

Research Engineer – AI Training Systems Reliability & Performance (Seed Infra)

ByteDance  •  $233k - $428k/yr  •  Seattle, WA (Onsite)  •  1 day ago
Apply
AI can make mistakes so check important info. Chat history is never stored.
49
AI Success™

Job Description

About the Team
The Seed Infrastructures team oversees the distributed training, reinforcement learning framework, high-performance inference, and heterogeneous hardware compilation technologies for AI foundation models.

Responsibilities
- Ensure the training platform operates reliably and efficiently across pre-training, fine-tuning, evaluation, and inference workloads for large models
- Build and maintain system observability, fault detection, and troubleshooting tools, enabling AI Ops-driven proactive monitoring of distributed ML workloads
- Maintain the stability, elasticity, and performance of framework and infrastructure components across multi-tenant, multi-cloud, and heterogeneous GPU environments
- Manage cluster governance, optimize resource utilization, and improve operational efficiency and reliability of ML services
- Develop software tools, dashboards, and automation to monitor, manage, and diagnose ML training infrastructure effectively
- Participate in global team rotations for system monitoring, on-call support, and incident response

The base salary range for this position in the selected city is $232560 - $427500 annually.
ByteDance

About ByteDance

ByteDance is a global incubator of platforms at the cutting edge of commerce, content, entertainment and enterprise services - over 2.5bn people interact with ByteDance products including TikTok.

Creation is the core of ByteDance's purpose. Our products are built to help imaginations thrive. This is doubly true of the teams that make our innovations possible.

Together, we inspire creativity and enrich life - a mission we aim towards achieving every day. At ByteDance, we create together and grow together. That's how we drive impact - for ourselves, our company, and the users we serve. We are committed to building a safe, healthy and positive online environment for all our users.

We have over 110,000 employees based in more than 30 countries globally. Join us.

Industry
IT & Software
Company Size
10,000+ employees
Headquarters
China, CN
Year Founded
Unknown
Social Media