Hyphen Connect

LLM Pre-training & Distributed Engineer (AI Infrastructure)

Hyphen Connect  •  Commonwealth of Australia (Onsite)  •  1 month ago
Apply
AI can make mistakes so check important info. Chat history is never stored.

Job Description

We are seeking a highly skilled LLM Pre-training & Distributed Systems Engineer. This role is essential for orchestrating large-scale machine learning training runs and optimizing distributed infrastructure. The ideal candidate will have a deep understanding of GPU clusters and extensive experience in system engineering to ensure efficient and reliable training processes.

Responsibilities:

  • Orchestrate distributed training runs across 1,000+ GPUs using PyTorch, DeepSpeed, or Megatron-LM.
  • Optimize networking (InfiniBand/RDMA) and memory management to prevent out-of-memory errors.
  • Automate checkpointing and failure recovery during month-long training runs.

Required Skills:

  • Deep expertise in 3D parallelism (Data, Tensor, Pipeline).
  • Experience managing SLURM or Kubernetes-based GPU clusters.
  • Strong systems engineering background (C++, CUDA, Python).
Hyphen Connect

About Hyphen Connect

Hyphen Connect: The Nexus of Web3 Talents

As your premier Web3 talent acquisition partner, Hyphen Connect is dedicated to driving innovation by connecting passionate talent with forward-thinking enterprises. We equip both with the essential knowledge and tools needed to excel in the rapidly evolving, decentralized landscape.

We serve as the link to top Web3 opportunities across infrastructure, DeFi, NFTs, gaming, and more, providing unparalleled insights, data-driven research, and comprehensive resources.

Join us and become an integral part of our thriving Web3 community. Let's connect!

Industry
HR & Recruiting
Company Size
1-10 employees
Headquarters
Unknown
Year Founded
2024
Social Media