hireworks

Senior Staff Software Engineer, Reliability

hireworks  •  Buenos Aires, AR (Remote)  •  4 days ago
Apply
AI can make mistakes so check important info. Chat history is never stored.

Job Description

About hireworks
hireworks is building a community of top talent in key international markets by unlocking unparalleled access to positions at leading U.S. based companies. As your employer, hireworks will ensure you have a seamless interview, onboarding, and employee experience - providing ongoing support and resources along the way. Established in 2023, hireworks is forging corp-to-corp relationships with leading U.S. based organizations looking to grow their teams with best-in-class talent around the world. Working with hireworks means unlocking access to a network of local peers and mentors and career opportunities through our client network.

About our client
Our client is building artificial intelligence to make the physical world more responsive. The company is pioneering what it calls the Recognition Economy, a future where repetitive tasks disappear and being recognized unlocks seamless access, comfort, and personalized experiences across everyday environments. From transforming parking into a frictionless drive-in, drive-out experience for millions of users to expanding its intelligence layer across industries such as retail and hospitality, the company is developing technology that makes real-world interactions more intuitive and efficient. As the organization continues to grow, it is looking for builders, innovators, and problem solvers who want to help shape the next generation of intelligent infrastructure for physical spaces



Our client is seeking a Staff Software Engineer focused on Reliability to own reliability across their entire platform and drive the comprehensive practices that ensure system availability, resilience, and observability for our mission-critical mobility infrastructure. In this role, you will build reliability from first principles, architecting failover systems, implementing chaos engineering, and improving our observability foundation to maintain 99.9%+ uptime as we scale to new markets.

As the technical owner of our reliability posture, you will tackle challenges like external service failover, dependency mirroring, and database replication, working alongside highly technical teams across the organization to influence architecture decisions and establish company-wide reliability standards. You will join the Product Foundations team, playing a key role in building the foundational infrastructure that powers the
future of mobility commerce.

 What You'll Do

  • Own the overall reliability posture for the platform, establishing
    practices, metrics, and systems that ensure 99.9%+ uptime across all services
    Design and implement automatic failover mechanisms for critical external
    dependencies like Twilio for SMS/voice and Stripe for payments with circuit
    breakers, retry policies, and degraded mode operations
  • Architect and build active-passive or active-active regional deployment
    strategies with database replication, automated failover, and DNS-based traffic
    routing including disaster recovery planning and testing
  • Establish comprehensive monitoring using Datadog for APM, logs, and metrics
    correlation
  • Implement synthetic monitoring, SLO-based alerting, on-call rotation, and
    escalation policies while building service health dashboards that show customer
    impact
  • Own the incident management process including workflows, tooling,
    post-mortem culture, runbook automation, and MTTR reduction initiatives to
    drive down mean time to recovery from detection to resolution
  • Drive adoption of resilience patterns across all services including health checks,
    graceful degradation, feature flags, rate limiting, backpressure mechanisms, and
    chaos engineering practices
  • Build and maintain local mirrors for critical dependencies with artifact caching,
    dependency pinning, and vulnerability scanning to prevent build failures from
    upstream outages

 About You

  • 8+ years of engineering experience including software engineering, reliability
    engineering, SRE practices, or production operations at scale
  • Demonstrate expert-level reliability engineering skills including hands-on
    experience with multi-region architectures, failover automation, circuit breakers,
    chaos engineering, and disaster recovery
  • Utilize production observability expertise with deep experience implementing
    monitoring, alerting, tracing, and logging systems at scale – specifically Datadog
    or similar APM platforms in high-load environments
  • Apply strong systems thinking with proven ability to design resilient distributed
    systems that gracefully handle failures, network partitions, and external
    dependency outages
  • Demonstrate database and data systems knowledge including replication
    strategies, backup/restore procedures, connection pooling, query optimization,
    and experience with both relational and NoSQL databases
  • Leverage cloud platform expertise with production experience operating and
    ensuring reliability of systems on AWS including multi-region deployments, load
    balancing, and DNS-based failover
  • Possess experience with AI-powered development tools such as Claude Code,
    GitHub Copilot, or similar agentic coding tools for enhanced productivity –
    context engineering in particular
  • Exhibit excellent technical communication with ability to influence technical
    decisions across teams, document complex systems, conduct post-mortems,
    and establish reliability standards organization-wide
  • Demonstrate expert-level Java and/or Scala proficiency with strong
    understanding of JVM performance, concurrency, and operational
    characteristics

 Our Stack

  • Languages + Frameworks: TypeScript, React, Scala (principally), Java (limited)
  • Datastores: MySQL, PostgreSQL, Snowflake
  • Cloud: AWS
  • Version control: Git & GitHub
  • AI Tooling: Copilot on GitHub
  • Observability: Datadog
Benefits
hireworks is cultivating a growing community of top talent across Colombia, Argentina or Bulgaria. In addition to unlocking access to positions at top tier U.S. based companies, we offer a variety of benefits to enhance your experience:
  • Competitive Pay – compensation that reflects your experience and accomplishments.
  • Remote Flexibility – work from anywhere within your local country (Colombia, Argentina or Bulgaria ), with the option to use co-working space as available locally.
  • Paid Time Off – ample vacation days to rest and recharge.
  • Public Holidays – all local federal holidays are fully paid days off.
hireworks

About hireworks

We do the work. You get the hire.

At hireworks, we're passionate about fueling startup founders and leaders with customized talent acquisition strategies that make a real difference. Our team of seasoned operators, who have been in your shoes, crafted the hireworks' solution from the heart, inspired by our own experiences at venture capital firms, and venture-backed startups and publicly traded companies.

We've revolutionized the outdated concept of offshoring from a complex maze into a simple, streamlined, embedded talent solution.

hireworks takes the reins, allowing our clients to effortlessly interview and hire top-notch embedded, remote candidates, without the operational headaches and associated costs.

Industry
HR & Recruiting
Company Size
11-50 employees
Headquarters
New York, NY
Year Founded
Unknown
Social Media