qode.world

Principal Cloud and Production Operations Engineer

qode.world  •  California (Hybrid)  •  2 days ago
Apply
AI can make mistakes so check important info. Chat history is never stored.

Job Description

The Principal Cloud and Production Operations Engineer serves as the senior technical authority responsible for architecting, automating, and optimizing hybrid and cloud-native production environments that power critical customer-facing services and enterprise applications.

This role combines deep cloud infrastructure expertise with strong production reliability and operational engineering skills. The Principal Engineer acts as both architect and hands-on builder, ensuring scalability, resilience, and security across multi-cloud and on-prem environments.

Reporting to the Associate Director of IT and Infrastructure, this position will collaborate closely with Engineering, DevOps, Security, and IT Operations to drive a culture of automation, observability, and continuous improvement across the production ecosystem.

Key Responsibilities:

Cloud Architecture and Engineering

•Design, implement, and maintain cloud and hybrid infrastructure supporting production workloads, enterprise systems, and CI/CD pipelines

•Lead the adoption of infrastructure-as-code (IaC) using Terraform, CloudFormation, or similar tools to enable repeatable, auditable, and secure deployments

•Architect scalable and fault-tolerant solutions across OCI, AWS, Azure, and on-prem data centers, ensuring high availability and cost efficiency

•Evaluate emerging cloud services and technologies for applicability to business needs and long-term scalability goals

Production Operations and Reliability

•Serve as the technical lead for production operations, ensuring uptime, performance, and reliability of customer-facing and internal systems

•Develop and maintain observability frameworks leveraging metrics, logs, and traces to ensure proactive detection and rapid response

•Partner with engineering teams to implement SRE-inspired practices, including service level objectives (SLOs), error budgets, and post-incident reviews

•Drive root cause analysis, performance tuning, and continuous improvement of production services

Automation and CI/CD Enablement

•Collaborate with DevOps and application engineering teams to build and optimize automated deployment pipelines supporting frequent, low-risk releases

•Integrate security and compliance checks into CI/CD workflows to ensure production readiness and alignment with internal standards

•Design self-healing infrastructure and automated rollback mechanisms to reduce operational risk

•Ensure secure and reliable configuration management and environment orchestration using tools such as Ansible, Chef, or Puppet

Operational Governance and Collaboration

•Establish and enforce operational best practices for monitoring, patching, and change management across production systems

•Lead production readiness reviews for new releases and large-scale changes

•Collaborate with the Security and Compliance teams to ensure systems adhere to policy, hardening standards, and regulatory requirements

•Participate in and occasionally lead on-call rotations for critical production systems, ensuring rapid triage and resolution

Leadership and Mentorship

•Act as a technical mentor to cloud and infrastructure engineers, fostering a culture of knowledge sharing and engineering excellence

•Lead architectural reviews, design sessions, and capacity planning discussions

•Serve as a trusted advisor to management on cloud modernization, resilience engineering, and cost optimization strategies

Qualifications:

•Bachelor’s degree in Computer Science, Information Systems, or related field; Master’s preferred

•10+ years of experience in cloud and infrastructure engineering, including 3+ years in a senior or principal role

•Expertise with OCI (preferred), AWS and/or Azure cloud services, including networking, compute, storage, and identity management

•Proven experience managing production-scale environments supporting mission-critical applications and services

•Strong proficiency in:

-Infrastructure-as-code (Terraform, CloudFormation)

-CI/CD and DevOps toolchains (Jenkins, GitLab, ArgoCD)

-Container orchestration (Kubernetes, Docker)

-Monitoring and observability platforms (Prometheus, Grafana, Datadog, ELK)

-Scripting and automation (Python, Bash, PowerShell)

•Solid understanding of security, compliance, and networking principles in hybrid environments

•Exceptional analytical, problem-solving, and incident management skills

•Demonstrated ability to lead complex, cross-functional initiatives from concept to execution

Preferred Experience:

•Experience in high-availability SaaS or networking environments

•Knowledge of FinOps, cost optimization, and multi-cloud governance frameworks

•Familiarity with Zero Trust, identity federation, and cloud access security model

•Exposure to AI/ML infrastructure or data-driven pipelines is a plus

qode.world

About qode.world

We revolutionize how talent finds meaningful careers by harnessing the power of data and automation. Our platform utilizes LLMs to parse resumes and reconstruct queries, transforming unstructured data into actionable insights. This enables us to build robust data moats, such as creating 'Private Talent Pools' for recruiters where autonomous agents enrich candidate profiles.

By automating high-volume recruiting workflows, we reduce the marginal cost of work to zero. Agents match profiles to job descriptions, find contact information, and send personalized messages and schedule interviews automatically, significantly decreasing the time to close. Additionally, we transcribe the interviews and make the data searchable, making hiring decisions more objective.

We drive confidence by raising the quality bar for job seekers. We automate technical exercises such as coding tests, evaluate candidates on merit, providing recruiters with pass/fail scores and qualitative feedback.

We also provide Exclusive or Retained Recruitment services, offering specialized recruitment with no upfront cost or a retained model with a partial fee, ensuring exclusivity and dedicated support throughout the hiring process.

Our Fractional Head of People and HR Advisory services offer flexible, strategic support through part-time or interim roles, as well as comprehensive advisory services to guide crucial HR decision-making.

Lastly, our HR Due Diligence process provides thorough insights into the HR frameworks of target companies, helping mitigate risks across the board.

How do you envision the future of recruiting with the integration of such advanced technologies?

Industry
IT & Software
Company Size
51-200 employees
Headquarters
Singapore , SG
Year Founded
2023
Social Media