Job Description
About the Team
We are dedicated to building the training infrastructure for ultra-large-scale language models, vision-language models, and frontier agentic models. Our mission is to provide a robust, scalable, and high-performance foundation for post-training, multimodal learning, and reinforcement learning at the hundred-billion-parameter scale and beyond. You will work on some of the most challenging problems in large-model training systems, from multimodal data efficiency to convergence optimization for next-generation foundation models.
What You'll Do
- Build and evolve unified training infrastructure for large models across post-training workflows, modalities, and training paradigms
- Design and optimize distributed training strategies for 100B to 1T parameter models, including DP, TP, PP, EP, operator fusion, memory optimization, and cluster-level MFU improvement
- Develop training and evaluation systems for Reasoning RL and Agent RL, including benchmarks, harnesses, convergence optimization, and rollout efficiency
- Enable multimodal training across image, text, audio, and video, and support emerging architectures such as MoE and Linear Attention with correctness and convergence validation