Job Description

Meet the Team

You will be pivotal in contributing to the team responsible for designing and developing the next generation of scalable Kubernetes infrastructure with machine learning platforms that support both traditional ML and state-of-the-art Large Language Models (LLMs). This is a position for expert engineers where you will lead the technical direction, ensuring the performance, reliability, and scalability of AI systems while collaborating closely with data scientists, researchers, and other engineering teams.

Your Impact

The ideal candidate will have strong hands-on expertise inRed Hat OpenShift, proficiency inGolang and/or Python, and a passion for delivering highly reliable, scalable, and secure infrastructure. Hands on experience to AI technologies such asLarge Language Models (LLMs), Retrieval-Augmented Generation (RAG)&GPU frameworks.

Core Responsibilities

Design, deploy, administer, and optimize highly available Red Hat OpenShiftplatforms.
Implement and drive Site Reliability Engineering (SRE) practices to ensure platform reliability, scalability, and operational excellence.
Develop automation tools, operators, and platform services using Golang and/or Python
Manage cluster lifecycle activities including upgrades, patching, capacity planning, and performance tuning.
Build and maintain CI/CD pipelines and Infrastructure as Code (IaC) solutions.
Implement and maintain observability solutions including logging, metrics, tracing, and alerting.
Monitor platform health and proactively identify and resolve reliability and performance issues.
Solve production incidents, perform root cause analysis (RCA), and drive preventive actions.
Collaborate closely with application and DevOps teams to improve deployment processes and platform adoption.
Ensure platform security, compliance, and consistency to organizational standards and procedures.
Participate in 16×5 on-call support rotation, providing timely response and resolution for production incidents and ensuring service availability.
Continuously evaluate and accept emerging technologies to enhance platform capabilities and operational efficiency.
Collaborate with global cross-functional teams across regions to support platform initiatives, drive operational excellence, and ensure seamless delivery of services and solutions.
GPU as a Service Platform offering and provide client support for hosting AI/ML workload powered by GPU

Minimum Qualifications / Requirement

4+ years of experience in Site Reliability Engineering, Platform Engineering, DevOps, or related roles.
Strong hands-on experience with Red Hat OpenShiftadministration, operations, and troubleshooting.
Proficiency in Golangand/or Pythonfor automation and platform engineering.
Experience with container technologies such as Dockerand container runtimes.
Strong understanding of Linux systems, networking, and distributed systems concepts.
Experience with Infrastructure as Code (IaC) tools such as Terraform, Ansible, or equivalent.
Experience with CI/CD tools such as Jenkins, GitLab CI, ArgoCD, Tekton, or similar.
Proven experience with observability tools such as Prometheus, Grafana, ELK, Loki, Jaeger, and OpenTelemetry.
Strong troubleshooting, debugging, and incident management capabilities.
Hands on experience to AI/ML platforms, Large Language Models (LLMs), and Retrieval-Augmented Generation (RAG)& GPU architectures.
Experience with AI frameworks such as LangChain, LlamaIndex, or vector databases.
Ability to support and participate in16×5 on-call rotationsfor critical production environments

Preferred Qualifications / Requirements

Familiarity with public cloud platforms (AWS, Azure, or GCP)
Familiarity with GitOps methodologies and tools.
Experience with service mesh technologies such as Istio
Knowledge of container and platform security standards.
Reliability-first and automation-driven attitude.
Strong analytical and problem-solving skills.
Ability to work effectively in a fast-paced production environment.
Excellent communication and partnership skills.
Ownership, accountability, and a customer-focused approach.

Why Cisco?

At Cisco, we’re revolutionizing how data and infrastructure connect and protect organizations in the AI era – and beyond. We’ve been innovating fearlessly for 40 years to create solutions that power how humans and technology work together across the physical and digital worlds. These solutions provide customers with unparalleled security, visibility, and insights across the entire digital footprint.

Fueled by the depth and breadth of our technology, we experiment and create meaningful solutions. Add to that our worldwide network of doers and experts, and you’ll see that the opportunities to grow and build are limitless. We work as a team, collaborating with empathy to make really big things happen on a global scale. Because our solutions are everywhere, our impact is everywhere.

We are Cisco, and our power starts with you.

About Cisco

Cisco is the worldwide technology leader that is revolutionizing the way organizations connect and protect in the AI era. For more than 40 years, Cisco has securely connected the world. With its industry leading AI-powered solutions and services, Cisco enables its customers, partners and communities to unlock innovation, enhance productivity and strengthen digital resilience. With purpose at its core, Cisco remains committed to creating a more connected and inclusive future for all.

Industry

IT & Software

Company Size

10,000+ employees

Headquarters

San Jose, CA

Year Founded

1984

Website

cisco.com

Social Media

Platform Engineer – OpenShift+ AI-ML SRE | 4+ years

Job Description

Core Responsibilities

Why Cisco?

About Cisco