King Abdullah University of Science and Technology

AI/ML Support Automation Analyst

King Abdullah University of Science and Technology  •  Kingdom of Saudi Arabia (Onsite)  •  8 days ago
Apply
AI can make mistakes so check important info. Chat history is never stored.

Job Description

The AI/ML Support Automation Analyst will be a key member of the KSL AI Support Team, focusing on MLOps

infrastructure, container orchestration, and workflow automation at a supercomputing scale. Working under the

AI/ML Support Team Lead, this role is responsible for developing and maintaining secure, OCI-compliant container

images, robust CI/CD pipelines, and cloud-native MLOps workflows that enable researchers to efficiently deploy and

manage AI/ML workloads. The Analyst will bridge the gap between cutting-edge Kubernetes-based infrastructure

and the diverse needs of the research community, contributing to governance, technical enablement, and

community development initiatives.

Major Responsibilities

1MLOps and Container Development

• Providing timely and useful user support via telephone, walk-in, email, and ticketing system submissions

for all types of inquiries.

• Maintain high customer service standards in dealing with and responding to user issues and questions.

• Develop and maintain secure, OCI-compliant, and HPC-ready AI/ML and data science software container

images

• Design and implement robust MLOps workflows and pipelines at supercomputing scale

• Develop and maintain CI/CD pipelines for reproducible infrastructure and workflow deployment

• Design and deploy APIs for AI/ML services and inference endpoints

• Implement and manage Kubernetes-based orchestration, including CNI, CSI, and service mesh

configurations and optimization

• Deploy and maintain container registries (Harbor) and model registries (MLFlow, Kubeflow Model

Registry)

2Governance and Compliance Support

• Assist in computational readiness reviews for AI research projects

• Assist in AI model and artifact control reviews to ensure compliance with institutional standards

• Provide consultation to users on efficient resource usage for AI/ML and MLOps workflows

• Ensure container images and workflows comply with security policies and best practices

• Support the implementation of usage monitoring and reporting systems

3Performance and Benchmarking

• Perform performance debugging and tuning of MLOps and cloud-native workflows

• Develop and maintain AI/ML and MLOps workload benchmarks for procuring new systems

• Create and maintain regression testing workloads for existing clusters

• Deploy and maintain observability and resource monitoring stacks using Prometheus, Grafana, NVIDIA

DCGM, and Grafana Loki

• Contribute to technology evaluation and benchmarking exercises for future infrastructure investments

4Training and Documentation

• Create comprehensive training content for users on MLOps platforms, Kubernetes, and containerization

• Develop and maintain high-quality user documentation for automation tools and workflows

• Support the delivery of workshops on CI/CD, container orchestration, and MLOps best practices

• Contribute to knowledge transfer initiatives within the KAUST research community

• Provide one-on-one consultation to researchers on efficient use of automation infrastructure

Personal Requirements

Competencies

• Experience

• Demonstrated experience developing robust and complex MLOps pipelines

• Hands-on experience with API design and deployment

• Experience developing robust and portable CI/CD pipelines for reproducible infrastructure and workflow

deployment

• Experience supporting researchers or working in academic/research computing settings preferred

• Technical Skills - Essential

• Kubernetes: Strong expertise in Kubernetes, Container Network Interface (CNI), Container Storage

Interface (CSI), and Service Mesh

• MLOps: Experience developing and maintaining MLOps pipelines and workflows

• CI/CD: Proficiency in building CI/CD pipelines for infrastructure and application deployment

• Containerization: Experience building secure, OCI-compliant container images

• API Development: Experience in API design, development, and deployment

• Programming: Proficiency in Python; experience with Go, Bash scripting

• Linux: Strong Linux/Unix systems administration skills

• Technical Skills - Desired

• Experience with ArgoCD, Airflow, DASK, Spark for workflow orchestration

• Experience with Kubeflow, KServe, and Seldon for ML serving and pipelines

• Experience deploying and maintaining observability stacks (Prometheus, Grafana, NVIDIA DCGM, Grafana

Loki)

• Knowledge of Model Context Protocol (MCP) and agentic frameworks

• Experience deploying inference services at scale

• Experience deploying and maintaining container registries (Harbor) and model registries (MLFlow,

Kubeflow Model Registry, Artifact Hub)

• Experience with GitOps practices and Infrastructure as Code (Terraform, Ansible)

• Experience with HPC schedulers (SLURM) and HPC-cloud integration

• Soft Skills

• Strong problem-solving and analytical abilities

• Excellent written and verbal communication skills in English

• Customer service mindset with patience for supporting diverse skill levels

• Ability to work independently and as part of a collaborative team

• Strong documentation and knowledge-sharing practices

• Cultural sensitivity for working in an international environment

Preferred Qualifications

• Experience in national laboratories or major research computing facilities

• Experience with GPU scheduling and resource management in Kubernetes

• Background in DevOps or Site Reliability Engineering (SRE)

• Contributions to open-source cloud-native or MLOps projects

• Publications or presentations on MLOps, Kubernetes, or automation topics

• Knowledge of Saudi Arabia's Vision 2030 and national AI initiatives

• Additional certifications: AWS/Azure/GCP, Terraform, NVIDIA DLI

Qualifications

• Bachelor's or master’s degree in computer science, Data Science, Computational Science, Artificial

Intelligence, or a related field

• Certifications such as CKA (Certified Kubernetes Administrator), CKAD (Certified Kubernetes Application

Developer), CKS (Certified Kubernetes Security Specialist), or CNPE (Certified Cloud Native Platform

Engineer) are highly valued

Experience

• Minimum of 2 years of relevant experience

King Abdullah University of Science and Technology

About King Abdullah University of Science and Technology

Industry
Unknown
Company Size
Unknown
Headquarters
Unknown
Year Founded
Unknown
Website
edu.sa
Social Media