AGIBOT

具身智能大模型训练系统开发与优化实习生

AGIBOT  •  Shanghai, CN (Remote)  •  6 days ago
Apply
AI can make mistakes so check important info. Chat history is never stored.

Job Description

具身智能大模型训练系统开发与优化实习生上海实习职位描述将会参与如下四个典型训练系统优化方向工作(包括但不限于下面四个方向)
参与方向一:支撑大规模预训练/微调的高效稳定运行
关键任务:
1. 参与千卡级别分布式训练集群的框架优化,确保训练任务在大规模集群上的稳定性(任务失败率 < xxx%)和可恢复性(断点续训时间 < xxx分钟)
2. 优化训练任务的吞吐量(Throughput),相比基线提升至少 20%
3. 参与至少一种并行策略(数据并行、模型并行、流水线并行、MoE并行)的框架级实现或深度优化
方向二:降低大模型训练显存占用,支持更大规模模型
关键任务:
1. 集成或优化至少一种显存节省技术(ZeRO-1/2/3、重计算(Activation Checkpointing)、混合精度训练),使相同GPU下的可训练参数量提升 30% 以上
2. 参与FlashAttention、Flash-FFN等高性能算子在分布式训练框架中的集成与适配
3. 验证并对比不同并行策略组合(如 FSDP + 张量并行)的显存效率与计算效率
方向三:优化跨节点通信效率,降低通信开销
关键任务:
1. 使用 NCCL 或 CANN ACL profiling工具分析通信瓶颈(AllReduce、AllGather等),并提出至少 2 项有效优化方案
2. 参与实现通信与计算的重叠(overlap)优化,使通信隐藏比例提升至 50% 以上
3. 探索并验证低比特通信(如 FP8 梯度通信)在训练中的可行性与效果
方向四:完善训练框架的可用性与可观测性
关键任务:
1. 开发或优化训练监控 Dashboard,覆盖多类关键指标(GPU利用率、内存占用、通信耗时、吞吐量、损失曲线)
2. 参与实现训练任务的自动容错与恢复机制,支持节点级/进程级故障自动重调度
3. 撰写至少内部技术文档/最佳实践指南,帮助算法团队更高效使用训练框架职位要求1. 熟悉计算机体系结构,理解CPU/GPU/NPU、内存层级、PCIe/NVLink等硬件互联架构;
2. 熟练掌握 C/C++ 和 Python,有良好的工程能力;
3. 熟悉 PyTorch 分布式训练生态(DDP、FSDP、DTensor、Torch.compile等);
4. 了解大模型典型架构(Transformer、Attention、MoE、MHA/GQA/MQA等);
分布式系统方向(满足一项或多项)
1. 有分布式训练实际经验,熟悉NCCL、MPI、GLOO等通信库;
2. 了解或使用过Megatron-LM、DeepSpeed、FairScale、MindSpeed等大规模训练框架;
3. 熟悉集合通信原语(AllReduce、AllGather、ReduceScatter、All2All等),理解其在模型并行中的应用;
4. 有集群调度系统经验(Slurm、Kubernetes、Ray)者优先;
性能优化方向(加分项)
1. 有CUDA或CANN开发经验,熟悉GPU/NPU编程模型;
2. 熟悉训练性能profiling工具(Nsys、Nsight Compute、PyTorch Profiler、Tracing);
3. 了解FlashAttention、AllGatherMatMul、MatMulReduceScatter、MatMulAllReduce等高性能算子实现原理; 投递
AGIBOT

About AGIBOT

AgiBot builds world-leading general-purpose embodied robots and their application ecosystem by pioneering into the fusion and innovation of AI and robotics. AgiBot was founded in February 2023 by seasoned industry experts, including core executives from global technology leaders and top AI scientists. During its development, AgiBot has received ardent support and guidance from senior Chinese leaders. It has been invited on multiple occasions to serve as an industry representative and brief on the progress of the embodied intelligence sector.

Leveraging its industry-leading “1 Ontology + 3 Intelligence Interaction” architecture - built on the robotic embodiment that integrates manipulation, interaction, and locomotion intelligence - AgiBot has launched three robot series (AgiBot A2, Genie, and AgiBot X2) and the industry's first universal embodied foundation model, the "Genie Operator-1 (GO-1)”. This makes AgiBot the only company in the sector with a full product portfolio and comprehensive scenario coverage. AgiBot has also established a leading full-stack ecosystem to empower partners and enable transformation across a wide range of industries.

With its cutting-edge product technologies and eco-system, AgiBot is one of the world's first companies to accomplish large-scale production and commercial deployment of embodied robots, with its products now available in multiple countries and regions. In January 2025, AgiBot made history by mass-producing its 1,000th general-purpose embodied robot, setting a new industry milestone.

Industry
Manufacturing & Production
Company Size
51-200 employees
Headquarters
Shanghai, CN
Year Founded
Unknown
Social Media