NVIDIA

Manager, Software Engineering - NCCL

NVIDIA  •  Shanghai, CN (Onsite)  •  19 days ago
Apply
AI can make mistakes so check important info. Chat history is never stored.
52
AI Success™

Job Description

We are the GPU Communications Libraries and Networking team at NVIDIA. We deliver communication libraries like NCCL & NVSHMEM for Deep Learning and HPC. DL and HPC applications have a huge compute demand already and run on scales which go up to tens of thousands of GPUs. The GPUs are connected with high-speed interconnects (eg. NVLink, PCIe) within a node and with high-speed networking (eg. Infiniband, Ethernet) across the nodes. Communication performance between the GPUs has a direct impact on the end-to-end application performance; and the stakes are even higher at huge scales! We are looking for a dynamic and technical leader for our China NCCL team. This is an outstanding opportunity to push the limits on the state-of-the-art and deliver platforms the world has never seen before. Are you ready to contribute to the development of innovative technologies and help realize NVIDIA's vision?

What you will be doing:

  • Lead, mentor, and grow our China engineering team. Own the end-to-end execution spanning planning, prioritization, quality control and performance.

  • Interact with customers and researchers to understand their use cases and requirements. Collaborate with engineering, program and product management, and partners to define the product roadmap.

  • Contribute to feature design and implementation.

  • Continuously review and identify improvement opportunities in established processes, infrastructure, and practices to ensure the teams are accomplishing work in the most efficient and transparent manner.

What we need to see:

  • 10+ overall years of experience in the software industry with 4+ years of management experience.

  • Bachelors, Masters, or Ph.D. in CS, CE, EE (related technical field) or equivalent experience.

  • Specialization in systems software, communication runtimes, or high performance networking. Proven success in managing several complex initiatives or products through the full product life cycle.

  • Strong understanding of computer systems architecture, networking technologies (RDMA, RoCE, Ethernet, EFA, InfiniBand) and topologies, operating systems principles (aka systems software fundamentals), HW-SW interactions and performance analysis/optimizations.

  • Hands-on C/C++ programming and debugging skills in Linux.

  • Experience balancing multiple projects with competing priorities. Flexibility to work and communicate effectively across different teams and timezones.

Ways to stand out from the crowd:

  • Active user or developer of NCCL!

  • Customer engagement experience in this space.

  • Experience with parallel programming models (MPI, SHMEM) and at least one communication runtime (MPI, NCCL, NVSHMEM, NIXL, OpenSHMEM, UCX, UCC).

  • Experience with programming using CUDA, MPI, OpenMP, OpenACC, pthreads.

  • Knowledge of HPC and ML/DL fundamentals. Experience with Deep Learning Frameworks such as PyTorch, TensorFlow, vLLM, SGLang, TRT-LLM, etc.

NVIDIA

About NVIDIA

Since its founding in 1993, NVIDIA (NASDAQ: NVDA) has been a pioneer in accelerated computing. The company’s invention of the GPU in 1999 sparked the growth of the PC gaming market, redefined computer graphics, ignited the era of modern AI and is fueling the creation of the metaverse. NVIDIA is now a full-stack computing company with data-center-scale offerings that are reshaping industry.

Industry
Hardware & Semiconductors
Company Size
10,000+ employees
Headquarters
Santa Clara, CA
Year Founded
1993
Social Media