Job Description
This position is posted by Jobgether on behalf of a partner company. We are currently looking for a Senior Site Reliability Engineer (m/f/d) in Germany.
This role offers the opportunity to shape and scale the infrastructure powering a modern AI-driven platform used by frontline employees across industries worldwide. As part of a highly collaborative Platform Squad, you will take ownership of critical reliability and scalability initiatives while driving architectural decisions that directly impact system resilience and performance. You will work on high-throughput, cloud-native environments built on Kubernetes and modern observability stacks, helping engineering teams operate more efficiently and securely. The position combines hands-on technical leadership with mentoring responsibilities, making it ideal for experienced engineers who enjoy solving complex infrastructure challenges while elevating team capabilities. You will play a key role in defining platform reliability standards, improving operational excellence, and enabling global scalability in a fast-growing tech environment. This is a high-impact opportunity for engineers passionate about automation, distributed systems, and cloud-native infrastructure.
Accountabilities:
- Drive the architecture and evolution of scalable cloud infrastructure and Kubernetes environments designed for high availability and global growth.
- Define and implement platform reliability strategies, including zero-downtime deployments, disaster recovery, rollback mechanisms, and resilience improvements.
- Improve and maintain observability systems, monitoring frameworks, and telemetry infrastructure to support operational excellence and system transparency.
- Build and optimize Infrastructure as Code and self-service platform capabilities to reduce operational overhead and improve developer experience.
- Lead platform-related incident response activities, conduct blameless post-mortems, and implement long-term systemic improvements.
- Collaborate closely with engineering teams to define technical roadmaps, architecture standards, and scalable operational practices.
- Mentor and support teammates through technical guidance, design reviews, and knowledge sharing initiatives.
- Drive continuous improvement in CI/CD pipelines, GitOps workflows, automation strategies, and cloud-native infrastructure operations.
Requirements:
- 5+ years of hands-on experience in Site Reliability Engineering, Platform Engineering, DevOps, Cloud Infrastructure, or similar infrastructure-focused engineering roles.
- Proven expertise operating and scaling high-throughput, highly available production systems.
- Deep practical experience with Kubernetes in cloud environments such as Azure, AWS, or GCP.
- Strong understanding of observability concepts, including monitoring, SLIs, SLOs, error budgets, logging, and distributed tracing.
- Proficiency in Go or Python, with strong software engineering and automation skills.
- Experience with Infrastructure as Code tools such as Pulumi, Terraform, or OpenTofu, along with GitOps workflows and CI/CD automation.
- Strong knowledge of cloud-native technologies, distributed systems, and reliability engineering best practices.
- Demonstrated experience leading infrastructure initiatives, writing technical proposals, and driving architecture decisions.
- Strong communication skills with the ability to collaborate effectively across technical teams and stakeholders.
- Comfortable participating in on-call rotations and managing critical production incidents.
- Additional experience with service meshes, API gateways, Kubernetes operators, or highly available PostgreSQL environments is considered a plus.
Benefits:
- Remote-first work environment with flexibility to work from home across eligible locations.
- Opportunities for in-person collaboration through team events, workshops, and office gatherings.
- Flexible work arrangements supporting strong work-life balance.
- Wellness and lifestyle benefits, including fitness memberships and bike leasing programs.
- Inclusive, collaborative, and growth-focused company culture.
- Opportunity to contribute directly to the scaling of a fast-growing international technology platform.
- Access to regular team events, culture initiatives, and company gatherings.
- Possibility to work remotely from locations within the European Union depending on team arrangements.
- Strong emphasis on personal development, ownership, and long-term career growth.
How Jobgether works:
We use an AI-powered matching process to ensure your application is reviewed quickly, objectively, and fairly against the role's core requirements. Our system identifies the top-fitting candidates, and this shortlist is then shared directly with the hiring company. The final decision and next steps (interviews, assessments) are managed by their internal team.
We appreciate your interest and wish you the best!
Data Privacy Notice: By submitting your application, you acknowledge that Jobgether will process your personal data to evaluate your candidacy and share relevant information with the hiring employer. This processing is based on legitimate interest and pre-contractual measures under applicable data protection laws (including GDPR). You may exercise your rights (access, rectification, erasure, objection) at any time.
#LI-CL1