Job Description
About the Role
We are seeking a skilled DevOps Engineer to join our infrastructure team. In this role, you will manage Kubernetes-based systems in our on-premise data center, working across bare-metal hardware, networking, database administration, and application deployments. You will address the challenges of operating a private cloud environment, optimizing AI model performance, enhancing CI/CD pipelines, and supporting Enterprise Java and Python services. This position includes hands-on work with production data center racks to ensure reliability, scalability, and high performance for our AI-driven applications.
Responsibilities
• Design, deploy, and maintain Kubernetes clusters using Helm for orchestration and configuration.
• Build and optimize CI/CD pipelines with Git, Jenkins, and related tools.
• Administer Postgres and NoSQL databases, including tuning, scaling, backups, and security.
• Manage bare-metal hardware: provisioning, monitoring, troubleshooting, and maintaining servers and network components.
• Deploy, monitor, and support Enterprise Java and Python applications in production.
• Collaborate with AI teams to integrate and operate machine learning models at scale.
• Monitor system performance, implement automation, and respond to incidents to ensure high availability.
• Contribute to infrastructure-as-code, documentation, and best practices for on-premise cloud operations.
Required Skills & Qualifications
• Bachelor’s degree in Computer Science, Engineering, or equivalent experience.
• 3+ years in DevOps, cloud engineering, or systems administration, focused on Kubernetes and container orchestration.
• Strong proficiency with Helm for Kubernetes application management.
• Hands-on experience with Git and Jenkins for CI/CD.
• Expertise in Postgres administration: optimization, replication, and disaster recovery.
• Familiarity with Enterprise Java and Python for application support and scripting.
• Strong Linux system administration skills.
• Knowledge of bare-metal hardware management, data center operations, and networking (VLANs, firewalls, load balancers).
• Exposure to AI/ML workflows, including model deployment and inference optimization.
• Strong problem-solving skills in complex production environments.
• Experience operating private cloud or on-premise data centers at scale.
• Scripting skills in Bash, Python, or similar.