Job Description

Application Operations Engineer

Department: Engineering

Employment Type: Full Time

Location: Plovdiv

Reporting To: Regional Head of Application Operations

Compensation: €48,500 / year

Reward Gateway, part of Edenred, is a global leader in benefits and employee engagement. We help businesses attract, engage, and retain top talent through strategic reward, recognition, and well-being solutions.

Guided by our shared missions - ‘Making the World a Better Place to Work’ and ‘Enriching Connections, For Good’ - we’re committed to transforming workplaces and improving people’s daily lives.

Our team embodies entrepreneurial spirit, innovation, and respect. We push boundaries, speak up, and stay human, fostering a culture where imagination thrives.

Your Role in our Mission:
This hands-on role sits at the intersection of operational excellence and engineering craft. You’ll bridge the gap between traditional application support and software engineering by executing scripted remediation, configuration management, feature flag operations, safe, bounded code-level fixes, and runbook automation — all under clearly defined guardrails.

The goal is to reduce unnecessary L3 escalations while increasing autonomy, quality, and impact for our Application Operations function. You’ll apply these practices across our AWS environment (EKS), PHP services, and MySQL databases, using Datadog as our observability platform, Kibana for log exploration, and Heap to help quantify and understand customer impact.

What’s In It For Me?

A chance to be part of an extremely well established, stable and high growth ‘Unicorn’ SaaS company with plenty of benefits in our employee benefits package, including:

Annual Wellness Bonus
Monthly Edenred Electronic Food Voucher
Udemy: Access for your professional development
Flexible Holiday plan & other leave benefits
Book Benefit: Professional development books and an additional annual budget for fiction books of your choice
Subsidised sports card and many other benefits!

Flexible Hybrid Working: This is a hybrid role that would require presence in office at least twice per week, as agreed.

What You’ll be Doing:

L2.5 Operations Delivery

Provide high-quality, timely L2.5 support for PHP applications running on EKS with MySQL backends, operating within clear guardrails that include configuration changes, feature flag operations, scripted runbooks, and safe, bounded code-level fixes.
Model a shift-left mindset: resolve more at L2.5, automate more, and escalate less, increasing the percentage of incidents resolved without L3 involvement and improving MTTR.
Participate in a healthy, sustainable on-call rotation with fair schedules, clear escalation paths, and strong post-incident learning practices.

Engineering Practices Within Operations

Apply engineering discipline to operational work: use version control, code review, and testing standards for scripts, runbooks, and automation tooling you produce.
Develop and maintain automation scripts, runbooks, and playbooks for known issue patterns across workloads, services, and operational scenarios.
Identify and automate repetitive remediation tasks to reduce manual toil and improve MTTR.

Observability and Service Readiness

Collaborate with peers to ensure the right monitoring signals, dashboards, and alerts exist in Datadog. Tune app-level alerts and dashboards to minimize noise and surface actionable signals.
Use Kibana to interrogate logs and correlate events with Datadog signals during investigations; improve log usefulness by feeding back patterns for better parsing and context.
Use Heap to triangulate and quantify customer impact (affected flows, cohorts, and volumes) during incidents and problem investigations; incorporate findings into incident timelines and post-incident reviews.
Participate in service onboarding and operability reviews to ensure new and changed services meet defined supportability standards before production.
Contribute to the Service Catalogue with accurate ownership, SLAs/SLOs, runbooks, and escalation paths for supported services.

Technical Operations and Incident Participation

Act as a first responder for application incidents at L2.5: triage, diagnose, and remediate within guardrails (e.g., safe config changes, feature flag toggles, rolling restarts, cache purges, scripted data fixes). Support major incidents by providing technical context, structured diagnostics, Datadog/Kibana evidence, Heap impact analysis, and coordinated remediation alongside the incident commander.
Use structured diagnostics before escalating — attach clear evidence, reproducibility steps, and impact assessments to every L3/SRE handoff.
Feed operational findings into Problem Management and contribute to post-incident reviews; capture learning in improved runbooks, alerts, and automation.

Quality, Process, and Continuous Improvement

Help define, measure, and report on operational KPIs such as MTTR, percentage resolved at L2/L2.5, escalation rate, first-contact resolution, and SLO adherence.
Continuously assess processes and workflows, delivering improvements that increase efficiency, consistency, and quality; balance reactive demand with proactive improvement work in Agile-aligned ways of working.
Maintain high standards of documentation — runbooks, known errors, and operational guides are accurate, accessible, and kept up to date.

Stakeholder Collaboration

Work closely with the Director of Application Operations, Problem Manager, and PETO peers (Platform, Infrastructure, Data, SRE) to ensure a coherent, joined-up operational approach.
Partner with product-aligned engineering teams to understand application architecture, service dependencies, and failure modes; encode this knowledge into operational capabilities and runbooks.

Scope and Interfaces (complementary to SRE)

In scope: application-centric remediation under guardrails; automation of known issue patterns; high-quality runbooks; structured diagnostics; service readiness/documentation for PHP services on EKS with MySQL; ownership of app-level dashboards/alerts in Datadog, investigative use of Kibana logs, and customer-impact analysis via Heap.

Experience and Skills You Need in this Role:

Essential

Proven experience in application support or operations engineering in cloud environments, ideally supporting PHP services running on Kubernetes (EKS) with MySQL backends.
Hands-on capability in at least one backend language (PHP preferred; Python or similar also valuable) sufficient to read, diagnose, and write safe operational scripts and minor fixes under guardrails.
Practical Kubernetes skills for operations: kubectl/Helm basics, investigating pods/deployments, reading logs/events, understanding readiness/liveness probes, and performing safe rollouts/rollbacks within documented guardrails.
MySQL operational fluency: connection and pool issues, slow query detection, query plan basics, common remediation patterns (e.g., indexing recommendations to hand to L3, safe data fixes under runbook guardrails), and understanding of replication/backup implications.
Strong experience using Datadog (APM/metrics/traces/dashboards/alerts) for investigation and detection; confident using Kibana for log exploration and correlation; ability to leverage Heap to assess user impact and prioritize remediation.
Familiarity with ITSM tooling (e.g., Jira Service Management) and ITIL-aligned incident and problem management processes.
Strong communication skills; clear, concise documentation; collaborative approach focused on reducing toil, increasing automation, and raising the quality bar.

Desirable

Experience with feature flag platforms and configuration-as-code within safe operational guardrails.
Familiarity with AWS services that commonly interface with PHP/EKS workloads (e.g., CloudWatch, ALB, S3, SQS) and how they surface in Datadog and Kibana.
Exposure to service onboarding/operability reviews, SLOs, and contributing to a Service Catalogue.
Experience balancing incident response with proactive improvement work in Agile contexts; strong documentation discipline.

The Interview Process:

Interview process: technical interview with Application Operations leadership and a PETO peer; practical scenario or technical assessment relevant to the L2.5 operating model

On-call rotation with fair schedules, clear escalation paths, and emphasis on post-incident learning and sustainable practices.

At Reward Gateway | Edenred we are committed to ensuring an inclusive and accessible recruitment process for all candidates. If you have any specific requirements or need reasonable adjustments at any stage of the recruitment journey, please let your Talent Acquisition Partner know. Your needs are important to us, and we want to ensure an equitable experience for every candidate.

Be comfortable. Be you.
At Reward Gateway, we want all our employees to feel comfortable bringing their passion, creativity and individuality to work. We value all cultures, backgrounds, and experiences, as we truly believe that diversity drives innovation. Express yourself, join our community and help us Make the World a Better Place to Work.

About Reward Gateway

Since 2006, we’ve helped the most innovative companies and HR leaders transform the employee experience to attract and retain top talent through employee benefits, strategic reward and recognition, wellbeing and much more. Across the globe, over 750 of us work together to make the world a better place to work, and as an ambitious, fast-growth, HR Tech SaaS company we’re flexible, inclusive and keen to meet talented individuals who are passionate about positively impacting the future of work. Clients include American Express, Unilever, Samsung, IBM and McDonald's. For further information, please visit: www.rewardgateway.com

Industry

IT & Software

Company Size

501-1,000 employees

Headquarters

London, GB

Year Founded

2006

Website

rewardgateway.com

Social Media