Job Description
We are looking for a Data Platform Reliability Engineer to ensure the reliability, scalability, and performance of our enterprise data platform. This role blends Site Reliability Engineering (SRE) and DevOps practices to support modern cloud-based data ecosystems built on Google BigQuery, lakehouse architectures, and distributed data pipelines.
You will play a key role in building highly resilient, governable, observable, and cost-efficient data platforms, while enabling engineering teams to operate at scale with automation and best practices.
Key Responsibilities
- Automate infrastructure provisioning and operations using Infrastructure as Code (IaC)
- Implement and manage CI/CD pipelines for data and platform deployments
- Implement and manage Data Governance tools such as Collibra
- Improve system resilience through capacity planning, performance tuning, and fault tolerance design
- Optimize cloud usage and costs through FinOps best practices
- Collaborate with engineering and analytics teams to improve platform reliability and developer experience
- Drive security, compliance, and access control best practices
- Own platform reliability and availability for enterprise data systems (SLAs, SLOs, error budgets)
- Monitor and manage data cloud infrastructure, ingestion frameworks, and transformation workflows
- Lead incident management, root cause analysis (RCA), and postmortems
Required Skills & Experience
- 3–5 years of experience in SRE, DevOps, or platform engineering roles
- Strong experience with cloud platforms (GCP preferred; AWS/Azure is a plus)
- Experience with CI/CD tools (GitHub Actions, Jenkins, etc.)
- Proficiency in Python, Bash, and SQL
- Experience with Infrastructure as Code tools (Terraform preferred)
- Strong understanding of monitoring and observability tools (logs, metrics, tracing)
- Experience managing production incidents and on-call rotations
- Exposure to Kubernetes / containerization
- Understanding of data governance and security practices
- Exposure to AI/ML or GenAI tools for automation and operational efficiency
Preferred Skills
- Experience with data observability tools/frameworks
- Knowledge of Collibra or similar Data Governance tools
- Knowledge of Looker / Tableau / Power BI or similar BI tools
- Understanding of cost optimization (FinOps) in cloud data platforms
What You Will Bring
- Strong ownership mindset and bias for automation
- Ability to troubleshoot complex distributed systems under pressure
- Passion for improving system reliability and performance at scale
- Excellent collaboration and communication skills
- Willingness to participate in a 24/7 support/on-call rotation
Why Join Us