42dot

AI Infrastructure Engineer

42dot  •  Republic of Korea (Onsite)  •  2 months ago
Apply
AI can make mistakes so check important info. Chat history is never stored.
68
AI Success™

Job Description

We are looking for the best

At 42dot, our AI Infrastructure Engineer manages the high-performance AI infrastructure orchestrating thousands of GPUs across multiple data centers. You will contribute to the scaling, monitoring, and operational optimization required to maintain a robust and world-class computing environment.

Responsibilities

  • Operate and maintain a large-scale GPU cluster consisting of thousands of GPUs across multiple data centers using Kubernetes and Slurm.

  • Monitor and diagnose failures across the GPU hardware and software stacks to ensure high availability and rapid recovery.

  • Develop automation tools and scripts using Python or Shell to streamline repetitive infrastructure management tasks and improve operational efficiency.

  • Manage GPU resource quotas and provide technical support to ML researchers to ensure optimal utilization of computing resources.

  • Participate in the architectural design and performance tuning of distributed training environments for large-scale autonomous driving models.

Qualifications

  • Strong proficiency in Linux operating systems, including a solid understanding of kernel operations, process management, and system security.

  • Practical experience with containerization technologies (Docker) and orchestration (Kubernetes), including building, managing, and troubleshooting containerized environments.

  • Solid understanding of networking fundamentals, including TCP/IP and HTTP(S), with the ability to perform basic network troubleshooting.

  • Ability to write clean and maintainable scripts in Python or Shell for automation and system administration.

  • Logical approach to problem-solving with the persistence to identify and resolve root causes in complex, large-scale systems.

  • Strong communication skills to effectively collaborate with cross-functional teams and external partners.

Preferred Qualifications

  • Experience in building observability stacks with Prometheus, Grafana, and Datadog for large-scale clusters.

  • Experience in building or operating infrastructure on public cloud platforms such as AWS or GCP.

  • Knowledge of the NVIDIA accelerated computing stack, including drivers, CUDA, and NCCL.

  • Familiarity with the ML model training lifecycle and deep learning frameworks such as PyTorch or TensorFlow.

  • Experience with large-scale workload managers or resource scheduling tools such as Kubernetes or Slurm.

  • Familiarity with Infrastructure as Code (IaC) tools such as Terraform to manage complex infrastructure.

Interview Process

  • 서류ì „í˜• - 온라인 코딩테스트 - í™”ìƒë©´ì ‘ (1시간 내외) - ëŒ€ë©´ë©´ì ‘ (3시간 내외) - 최종합격

  • ì „í˜•ì ˆì°¨ëŠ” 직무별로 다르게 ìš´ì˜ë ìˆ˜ 있으며, ì¼ì • 및 상황에 따라 ë³€ë™ë ìˆ˜ 있습니다.

  • ì „í˜•ì¼ì • 및 결과는 지원서에 ë“±ë¡í•˜ì‹ ì´ë©”ì¼ë¡œ 개별 안내드립니다.

Additional Information

  • ì´ë ¥ì„œ ì œì¶œ 시 주민등록번호, 가족관계, 혼인 여부, 연봉, 사진, ì‹ ì²´ì¡°ê±´, ì¶œì‹ ì§€ì—­ 등 ì±„ìš©ì ˆì°¨ë²•ìƒ 요구 금지된 ì •ë³´ëŠ” ì œì™¸ 부탁드립니다.

  • ëª¨ë“ ì œì¶œ 파일은 30MB 이하의 PDF 양식으로 업로드를 부탁드립니다. (ì´ë ¥ì„œ 업로드 중 ë¬¸ì œê°€ 발생한다면 ì§€ì›í•˜ì‹œê³ ìž 하는 포지션의 URLê³¼ 함께 ì´ë ¥ì„œë¥¼ recruit@42dot.ai으로 ì „ì†¡ 부탁드립니다.)

  • 인터뷰 프로세스 종료 후 지원자의 동의하에 평판조회가 ì§„í–‰ë ìˆ˜ 있습니다.

  • 국가보훈대상자 및 취업보호 대상자는 ê´€ê³„ë²•ë ¹ì— 따라 우대합니다.

  • ìž¥ì• ì¸ ê³ ìš© 촉진 및 직업재활법에 따라 ìž¥ì• ì¸ 등록증 소지자를 우대합니다.

  • 42dot은 의뢰하지 않은 서치펌의 ì´ë ¥ì„œë¥¼ 받지 않으며, 요청하지 않은 ì´ë ¥ì„œì— 대해 수수료를 지불하지 않습니다.

※ 지원 ì „ 아래 내용을 ê¼­ 확인해 주세요.

42dot

About 42dot

We envision a world where everything is connected and moves autonomously through a self-managing urban transportation operating system. We are spearheading the transition to SDV (software-defined vehicle) with software and AI.

We are developing diverse SDV technologies that continuously provide

vehicle updates and services based on data for user-centric and safe mobility.

If you want to create a new future of mobility through software, AI, and automotive vehicles, check out our job openings. Come ride with us!

Industry
IT & Software
Company Size
501-1,000 employees
Headquarters
Seongnam-si, KR
Year Founded
2019
Website
42dot.ai
Social Media