Job Description

About hireworks
hireworks is building a community of top talent in key international markets by unlocking unparalleled access to positions at leading U.S. based companies. As your employer, hireworks will ensure you have a seamless interview, onboarding, and employee experience - providing ongoing support and resources along the way. Established in 2023, hireworks is forging corp-to-corp relationships with leading U.S. based organizations looking to grow their teams with best-in-class talent around the world. Working with hireworks means unlocking access to a network of local peers and mentors and career opportunities through our client network.

About our client
Our client is building artificial intelligence to make the physical world more responsive. The company is pioneering what it calls the Recognition Economy, a future where repetitive tasks disappear and being recognized unlocks seamless access, comfort, and personalized experiences across everyday environments. From transforming parking into a frictionless drive-in, drive-out experience for millions of users to expanding its intelligence layer across industries such as retail and hospitality, the company is developing technology that makes real-world interactions more intuitive and efficient. As the organization continues to grow, it is looking for builders, innovators, and problem solvers who want to help shape the next generation of intelligent infrastructure for physical spaces

Our client is seeking a Staff Software Engineer focused on Reliability to own reliability across their entire platform and drive the comprehensive practices that ensure system availability, resilience, and observability for our mission-critical mobility infrastructure. In this role, you will build reliability from first principles, architecting failover systems, implementing chaos engineering, and improving our observability foundation to maintain 99.9%+ uptime as we scale to new markets.

As the technical owner of our reliability posture, you will tackle challenges like external service failover, dependency mirroring, and database replication, working alongside highly technical teams across the organization to influence architecture decisions and establish company-wide reliability standards. You will join the Product Foundations team, playing a key role in building the foundational infrastructure that powers the
future of mobility commerce.

What You'll Do

Own the overall reliability posture for the platform, establishing
practices, metrics, and systems that ensure 99.9%+ uptime across all services
Design and implement automatic failover mechanisms for critical external
dependencies like Twilio for SMS/voice and Stripe for payments with circuit
breakers, retry policies, and degraded mode operations
Architect and build active-passive or active-active regional deployment
strategies with database replication, automated failover, and DNS-based traffic
routing including disaster recovery planning and testing
Establish comprehensive monitoring using Datadog for APM, logs, and metrics
correlation
Implement synthetic monitoring, SLO-based alerting, on-call rotation, and
escalation policies while building service health dashboards that show customer
impact
Own the incident management process including workflows, tooling,
post-mortem culture, runbook automation, and MTTR reduction initiatives to
drive down mean time to recovery from detection to resolution
Drive adoption of resilience patterns across all services including health checks,
graceful degradation, feature flags, rate limiting, backpressure mechanisms, and
chaos engineering practices
Build and maintain local mirrors for critical dependencies with artifact caching,
dependency pinning, and vulnerability scanning to prevent build failures from
upstream outages

About You

8+ years of engineering experience including software engineering, reliability
engineering, SRE practices, or production operations at scale
Demonstrate expert-level reliability engineering skills including hands-on
experience with multi-region architectures, failover automation, circuit breakers,
chaos engineering, and disaster recovery
Utilize production observability expertise with deep experience implementing
monitoring, alerting, tracing, and logging systems at scale – specifically Datadog
or similar APM platforms in high-load environments
Apply strong systems thinking with proven ability to design resilient distributed
systems that gracefully handle failures, network partitions, and external
dependency outages
Demonstrate database and data systems knowledge including replication
strategies, backup/restore procedures, connection pooling, query optimization,
and experience with both relational and NoSQL databases
Leverage cloud platform expertise with production experience operating and
ensuring reliability of systems on AWS including multi-region deployments, load
balancing, and DNS-based failover
Possess experience with AI-powered development tools such as Claude Code,
GitHub Copilot, or similar agentic coding tools for enhanced productivity –
context engineering in particular
Exhibit excellent technical communication with ability to influence technical
decisions across teams, document complex systems, conduct post-mortems,
and establish reliability standards organization-wide
Demonstrate expert-level Java and/or Scala proficiency with strong
understanding of JVM performance, concurrency, and operational
characteristics

Our Stack

Languages + Frameworks: TypeScript, React, Scala (principally), Java (limited)
Datastores: MySQL, PostgreSQL, Snowflake
Cloud: AWS
Version control: Git & GitHub
AI Tooling: Copilot on GitHub
Observability: Datadog

Benefits

hireworks is cultivating a growing community of top talent across Colombia, Argentina or Bulgaria. In addition to unlocking access to positions at top tier U.S. based companies, we offer a variety of benefits to enhance your experience:

Competitive Pay – compensation that reflects your experience and accomplishments.
Remote Flexibility – work from anywhere within your local country (Colombia, Argentina or Bulgaria ), with the option to use co-working space as available locally.
Paid Time Off – ample vacation days to rest and recharge.
Public Holidays – all local federal holidays are fully paid days off.

About hireworks

We do the work. You get the hire.

At hireworks, we're passionate about fueling startup founders and leaders with customized talent acquisition strategies that make a real difference. Our team of seasoned operators, who have been in your shoes, crafted the hireworks' solution from the heart, inspired by our own experiences at venture capital firms, and venture-backed startups and publicly traded companies.

We've revolutionized the outdated concept of offshoring from a complex maze into a simple, streamlined, embedded talent solution.

hireworks takes the reins, allowing our clients to effortlessly interview and hire top-notch embedded, remote candidates, without the operational headaches and associated costs.

Industry

HR & Recruiting

Company Size

11-50 employees

Headquarters

New York, NY

Year Founded

Unknown

Website

hireworks.io

Social Media

Senior Staff Software Engineer, Reliability

Job Description

Benefits

hireworks is cultivating a growing community of top talent across Colombia, Argentina or Bulgaria. In addition to unlocking access to positions at top tier U.S. based companies, we offer a variety of benefits to enhance your experience:

About hireworks