Job Description
About the Team
We are dedicated to building the inference infrastructure for ultra-large-scale language models, vision-language models, and frontier multimodal AI systems. Our mission is to provide a robust, scalable, and high-performance foundation for distributed serving, heterogeneous scheduling, and low-latency inference at massive scale. You will work on some of the most challenging problems in large-model online serving, spanning traffic orchestration, throughput and latency optimization, kernel efficiency, and production reliability for next-generation AI systems.
Responsibilities - What You'II Do
- Build and evolve next-generation inference systems for large-scale online traffic, including global scheduling across heterogeneous compute resources, high-concurrency load balancing, and efficient batch formation
- Optimize distributed inference for 200B+ models and complex multimodal models through TP, EP, DP, and related strategies to improve throughput and latency in production
- Develop high-performance kernels for frontier model architectures such as MoE, emerging attention mechanisms, and multimodal fusion layers using CUDA, Triton, and related tools
- Explore AI-driven infrastructure for inference systems, including AI Agents for kernel optimization, performance tuning, consistency validation, deployment pipelines, and intelligent operations