AI Success™

Job Description

Disclaimer – Must read: Commitment & Focus
This role requires full-time dedication, with clear priority given to Darwoft projects during the established working hours. It is not compatible with other full-time professional engagements. Any additional professional activities must be disclosed in advance and must not interfere with the responsibilities or working hours of this role.

About Darwoft

Darwoft is a software factory that develops custom software solutions and provides IT staff augmentation services for international clients, primarily in the United States and Latin America. We work with startups and high-growth companies to build high-impact digital products. Our culture is people-first, focused on technical quality, long-term relationships, and a collaborative mindset. We combine technical excellence with human proximity.

Senior Site Reliability Engineer (AI Platform & Observability) | Contractor Global

General Information

Location: Remote (Global)
Contract Type: Contractor
Industry / Project: AI Infrastructure & Platform Operations
Time Zone: Coordination with US / LATAM teams
English Level: Advanced (C1)

About the Role

We are seeking a Senior Site Reliability Engineer with a deep focus on observability and AI platform operations. This role sits at the intersection of reliability engineering and emerging AI infrastructure. You will own the instrumentation, visibility, and operational health of AI-powered systems, including LLM API gateways, token usage pipelines, and model serving infrastructure.

You will act as the authority on runtime behavior across our AI stack, building the tooling and insights required to understand, measure, and optimize system performance, reliability, and cost.

Responsibilities

Design and operate AI gateway infrastructure, including routing, rate limiting, and traffic shaping for LLM API traffic.
Build and maintain deep observability into AI workloads: token consumption, model latency, cost attribution, and error rates by model, team, and use case.
Define and track SLIs, SLOs, and error budgets for AI services and API-dependent workflows.
Instrument LLM-backed applications to surface prompt/completion telemetry, retry patterns, and quota burn rates.
Develop dashboards and alerting using Grafana, Loki, and Prometheus tailored to AI traffic patterns (beyond traditional infrastructure metrics).
Maintain and evolve observability pipelines capable of handling high-cardinality AI metadata.
Lead incident response for AI platform degradations, including model unavailability, gateway saturation, and upstream provider outages.
Automate operational workflows across AI infrastructure using Infrastructure as Code (IaC) and CI/CD practices.
Collaborate closely with ML/AI engineering teams to embed reliability and cost-visibility practices early in the development lifecycle.

Requirements

Must-Have:

Strong experience with AWS cloud services, specifically those relevant to AI workloads (Bedrock, SageMaker, Lambda, API Gateway).
Hands-on expertise with Kubernetes in production environments.
Proven experience building and operating observability stacks (Prometheus, Grafana, Loki), with an emphasis on application- and API-layer metrics.
Solid understanding of API gateway patterns, including routing, throttling, authentication, and traffic observability.
Experience instrumenting and monitoring LLM or AI API usage (token budgets, cost tracking, latency profiling).
Proficiency in Python, Go, or Bash for automation and tooling.
Mastery of Infrastructure as Code (Terraform) and CI/CD pipelines.
Strong analytical mindset with the ability to extract signal from high-cardinality telemetry.

Nice-to-Have:

Experience operating or integrating AI gateway solutions (e.g., Kong AI Gateway, Portkey, LiteLLM).
Familiarity with OpenTelemetry and distributed tracing for AI/ML workloads.
Experience with FinOps practices for AI, including chargeback models and cost anomaly detection.
Knowledge of service mesh technologies and their role in AI traffic management.

What We Offer (Contractor)

Contractor agreement with payment in USD.
100% remote work in an international environment.
Access to Argentine public holidays.
Professional development in the cutting-edge field of AI Platform Engineering.
Referral program and access to learning platforms.
English classes to further enhance professional communication.

Explore this and other opportunities at: www.darwoft.com/careers

About Darwoft

Darwoft is the leading nearshore software team accelerating innovation for healthcare companies worldwide. From disruptive startups to global enterprises, we deliver secure, powerful, and user-focused solutions that scale.

Our expertise includes MVP development, application modernization, UX/UI design, staff augmentation, and advanced Data + AI services that empower real-time decision-making and intelligent automation. We work across a variety of industries including healthcare, fintech, telecommunications, retail strategy, and direct sales.

With agile methodologies and cross-functional teams, we deliver value from day one, helping our clients move fast, stay competitive, and build products their users love.

At Darwoft, we foster a collaborative and growth-oriented work culture where technology meets creativity. We believe in strong partnerships, continuous learning, and building meaningful solutions that make an impact.

Let’s connect. Let’s build what’s next, together.

Industry

IT & Software

Company Size

51-200 employees

Headquarters

Hillsboro, Oregon

Year Founded

2010

Website

darwoft.com

Social Media

982 - SRE Engineer (AI Platform & Observability) · Senior · Remoto · LATAM

Job Description

About Darwoft

Senior Site Reliability Engineer (AI Platform & Observability) | Contractor Global

General Information

About the Role

Responsibilities

Requirements

What We Offer (Contractor)

About Darwoft