XPENG

大模型训练加速工程师 / 高级专家

XPENG  •  Onsite  •  3 months ago
Apply
AI can make mistakes so check important info. Chat history is never stored.

Job Description

大模型训练加速工程师 / 高级专家北京全职通用智能板块职位描述【岗位职责】
训练加速与优化: 负责大模型训练场景下的性能分析 (Profiling) 与全链路优化,包括显存管理、计算加速及通信优化 (NCCL),提升集群训练吞吐率。
算子开发与协同设计 (Co-design): 负责高性能算子 (Kernel) 的开发与调优;与算法团队紧密协作,针对特定模型结构(如 Transformer, MoE)进行定制化算子设计。
分布式框架建设: 基于 Megatron-LM, DeepSpeed, FSDP 等框架进行二次开发与优化,设计适应大规模集群的并行训练方案。
稳定性保障: 负责大规模训练过程中的问题定位与解决,包括但不限于 NCCL 超时、显存溢出 (OOM)、训练速度波动等,保障训练任务的高效稳定运行。职位要求理论基础: 计算机基础扎实,深刻理解深度学习训练原理(计算图、自动微分、混合精度),熟悉主流并行策略及 FlashAttention 等加速算法。
编程能力: 熟练掌握 Python/C++,熟悉 GPU 编程模型,具备 CUDA / Triton / TileLang 算子开发经验者优先。
框架经验: 深入理解至少一种主流训练框架(PyTorch, TensorFlow, Megatron, DeepSpeed),具备源码级修改能力。
领域经验: 熟悉 Transformer 架构,有 VLM (视觉语言模型)、Stable Diffusion、MoE 等前沿模型训练优化经验者优先。
工具与调试: 熟练使用 Nsight Systems/Compute 进行性能分析,熟悉 NCCL 通信库原理及常见故障排查。
学习能力: 具备良好的问题分析能力,保持对业界前沿论文 (ArXiv) 的追踪与复现习惯。 投递
XPENG

About XPENG

XPeng is a leading Chinese Smart EV company that designs, develops, manufactures, and markets Smart EVs that appeal to the large and growing base of technology-savvy middle-class consumers. Its mission is to drive Smart EV transformation with technology and data, shaping the mobility experience of the future. In order to optimize its customers’ mobility experience, XPeng develops in-house its full-stack advanced driver-assistance system technology and in-car intelligent operating system, as well as core vehicle systems including powertrain and the electrical/electronic architecture. XPeng is headquartered in Guangzhou, China. In 2021, the Company established its European headquarters in Amsterdam, along with other dedicated offices in Copenhagen, Munich, Oslo, and Stockholm.The Company’s Smart EVs are mainly manufactured at its plant in Zhaoqing and Guangzhou,Guangdong province.

For more information, please visit https://heyxpeng.com.

Industry
Automotive & Mobility
Company Size
1,001-5,000 employees
Headquarters
Guangzhou, CN
Year Founded
2014
Social Media