At 42dot, our AI Infrastructure Engineer manages the high-performance AI infrastructure orchestrating thousands of GPUs across multiple data centers. You will contribute to the scaling, monitoring, and operational optimization required to maintain a robust and world-class computing environment.
Responsibilities
Operate and maintain a large-scale GPU cluster consisting of thousands of GPUs across multiple data centers using Kubernetes and Slurm.
Monitor and diagnose failures across the GPU hardware and software stacks to ensure high availability and rapid recovery.
Develop automation tools and scripts using Python or Shell to streamline repetitive infrastructure management tasks and improve operational efficiency.
Manage GPU resource quotas and provide technical support to ML researchers to ensure optimal utilization of computing resources.
Participate in the architectural design and performance tuning of distributed training environments for large-scale autonomous driving models.
Qualifications
Strong proficiency in Linux operating systems, including a solid understanding of kernel operations, process management, and system security.
Practical experience with containerization technologies (Docker) and orchestration (Kubernetes), including building, managing, and troubleshooting containerized environments.
Solid understanding of networking fundamentals, including TCP/IP and HTTP(S), with the ability to perform basic network troubleshooting.
Ability to write clean and maintainable scripts in Python or Shell for automation and system administration.
Logical approach to problem-solving with the persistence to identify and resolve root causes in complex, large-scale systems.
Strong communication skills to effectively collaborate with cross-functional teams and external partners.
Preferred Qualifications
Experience in building observability stacks with Prometheus, Grafana, and Datadog for large-scale clusters.
Experience in building or operating infrastructure on public cloud platforms such as AWS or GCP.
Knowledge of the NVIDIA accelerated computing stack, including drivers, CUDA, and NCCL.
Familiarity with the ML model training lifecycle and deep learning frameworks such as PyTorch or TensorFlow.
Experience with large-scale workload managers or resource scheduling tools such as Kubernetes or Slurm.
Familiarity with Infrastructure as Code (IaC) tools such as Terraform to manage complex infrastructure.
Interview Process
서류ì í - ì¨ë¼ì¸ ì½ë©í ì¤í¸ - íìë©´ì (1ìê° ë´ì¸) - ëë©´ë©´ì (3ìê° ë´ì¸) - ìµì¢ í©ê²©
ì íì ì°¨ë ì§ë¬´ë³ë¡ ë¤ë¥´ê² ì´ìë ì ìì¼ë©°, ì¼ì ë° ìí©ì ë°ë¼ ë³ëë ì ììµëë¤.
ì íì¼ì ë° ê²°ê³¼ë ì§ììì ë±ë¡íì ì´ë©ì¼ë¡ ê°ë³ ìë´ë립ëë¤.
Additional Information
ì´ë ¥ì ì ì¶ ì 주민ë±ë¡ë²í¸, ê°ì¡±ê´ê³, í¼ì¸ ì¬ë¶, ì°ë´, ì¬ì§, ì ì²´ì¡°ê±´, ì¶ì ì§ì ë± ì±ì©ì ì°¨ë²ì ì구 ê¸ì§ë ì ë³´ë ì ì¸ ë¶íë립ëë¤.
모ë ì ì¶ íì¼ì 30MB ì´íì PDF ììì¼ë¡ ì ë¡ë를 ë¶íë립ëë¤. (ì´ë ¥ì ì ë¡ë ì¤ ë¬¸ì ê° ë°ìíë¤ë©´ ì§ìíìê³ ì íë í¬ì§ì ì URLê³¼ í¨ê» ì´ë ¥ì를 recruit@42dot.aiì¼ë¡ ì ì¡ ë¶íë립ëë¤.)
ì¸í°ë·° íë¡ì¸ì¤ ì¢ ë£ í ì§ììì ëìíì ííì¡°íê° ì§íë ì ììµëë¤.
êµê°ë³´íëìì ë° ì·¨ì ë³´í¸ ëììë ê´ê³ë²ë ¹ì ë°ë¼ ì°ëí©ëë¤.
ì¥ì ì¸ ê³ ì© ì´ì§ ë° ì§ì ì¬íë²ì ë°ë¼ ì¥ì ì¸ ë±ë¡ì¦ ìì§ì를 ì°ëí©ëë¤.
42dotì ì뢰íì§ ìì ìì¹íì ì´ë ¥ì를 ë°ì§ ìì¼ë©°, ìì²íì§ ìì ì´ë ¥ìì ëí´ ììë£ë¥¼ ì§ë¶íì§ ììµëë¤.
â» ì§ì ì ìë ë´ì©ì ê¼ íì¸í´ 주ì¸ì.
42dotì´ ì¼íë ë°©ì, 42dot Way ë³´ë¬ê°ê¸° â
42dotë§ì ì 무몰ì íë¡ê·¸ë¨, Employee Engagement Program ë³´ë¬ê°ê¸° â

We envision a world where everything is connected and moves autonomously through a self-managing urban transportation operating system. We are spearheading the transition to SDV (software-defined vehicle) with software and AI.
We are developing diverse SDV technologies that continuously provide
vehicle updates and services based on data for user-centric and safe mobility.
If you want to create a new future of mobility through software, AI, and automotive vehicles, check out our job openings. Come ride with us!