Job Description

Why this role matters

You will design and operate an on-prem AI platform for deploying and scaling models, working across multi-node GPU clusters, distributed systems, and Kubernetes. You will be responsible for building reliable and efficient infrastructure for large-scale model inference, ensuring optimal GPU utilization, performance, and availability of the platform.

The role is based in our Limassol office, Cyprus. In case of relocation, we offer full relocation support for you and your family to make your move smooth and worry-free.

What you'll actually do

Close collaboration with infrastructure teams on selection and configuring GPU servers, high-performance networking, and RDMA-enabled clusters.
Perform and manage GPU MIG configurations based on workload requirements and model characteristics.
Ensure reliable and scalable GPU operations in Kubernetes, including runtime integration, device plugins, and GPU scheduling capabilities.
Design, deploy, and maintain model serving runtimes, including vLLM, ONNX, SGLang, Nvidia Triton Runtimes, and KServe, ensuring high performance, scalability, and efficient GPU utilization.
Build and maintain CI/CD pipelines and tooling for model packaging, versioning, and deployment, enabling reliable and model delivery for internal teams.
Build and maintain platform tooling for model lifecycle management, including experiment tracking, model versioning, and registry systems (e.g. MLflow).
Enable infrastructure and workflows for model fine-tuning and adaptation (e.g. LoRA), focusing on scalability, reproducibility, and automation within the platform.
Develop and support internal tooling for managing model inputs and configurations (e.g. prompt templates), enabling consistent and reusable model usage patterns.
Conduct performance testing and evaluation of multi-node GPU clusters to identify and resolve bottlenecks.
Build and maintain observability for GPU clusters and model workloads, including metrics such as GPU utilization, memory usage, throughput, and latency.
Integrate tracing for model inference workflows to provide end-to-end visibility into requests, and model behavior.
Ensure compliance with security requirements for platform development.
Evaluate and benchmark model inference performance across different runtimes, hardware setups, and configurations to guide platform optimization.

Who we’re looking for

Bachelor’s or Master’s degree in Computer Science, Engineering, or a related technical field
5+ years of experience in infrastructure, platform engineering, or distributed systems, preferably in environments involving machine learning or GPU workloads
Strong experience with Kubernetes, including deploying and operating production workloads
Experience with Linux-based environments
Strong programming skills in Python and/or Go
Experience working with GPU infrastructure, including NVIDIA or AMD stack and multi-GPU environments will be considered highly advantageous
Understanding of distributed systems and multi-node workloads
Experience with model serving and inference systems (e.g. vLLM, ONNX, SGLang, Nvidia Triton Runtimes, KServe)
Experience with CI/CD pipelines and automation for deploying services or models
Experience with monitoring and observability tools (metrics, tracing, logging)
Nice to have familiarity with networking concepts relevant to distributed systems (e.g. RDMA, high-performance networking)
Good communication and problem-solving skills
Ability to use advanced English for different work and business purposes
Critical thinking and attention to detail
Decision-making skills and the ability to adapt to new changes
Ability to write concise and clear documentation
Capability of dealing with constructive critics and knowing how to develop relationships with the team to achieve common goals

What we offer along the way

Competitive salary and annual performance bonus
Full relocation support for you and your family — flights, housing, visas, and legal assistance included
Top-tier health insurance with full family coverage — medical, dental, vision, mental health — plus life insurance for peace of mind
Unlimited learning opportunities: external courses, English lessons, career and leadership development
Education allowance covering school and kindergarten fees
21 working days of annual leave, plus public holidays and fully paid sick, maternity, and paternity leave
Employee appreciation program: branded gifts, birthday day-offs, celebration budgets for weddings, newborns, and milestones
“Get to know Team” trips — meet colleagues across our global hubs, along with company-wide offsites that raise the bar
Employee share scheme — grow with us
Branded MINI Cooper Countryman company car and private parking
Free in-house sports clubs, Sanctum Club gym access, and jet skis
Access to a Corporate doctor
Exclusive discount program with cafes, gyms, and local services
Expat tax perks: up to 50% income tax exemption
Support with the naturalisation process for relocated employees

What your journey looks like

Intro call with Recruiter (30 minutes)
Tech interview (90 minutes)
Behavioural interview (60 minutes)

Please use your exness work email for internal applications and ensure to disclose any existing Conflict of Interest you may have.

About Exness

Exness is a global fintech company and one of the world’s leading multi-asset brokers, with 40M daily order executions and 1M+ active clients. We design, build, and run the trading infrastructure that powers the markets, combining deep financial expertise with proprietary tech to deliver speed, stability, and transparency at scale.

We’re more than a trading company. We're a product company, a data company, and a culture-first company—driven by impact and built by people who care. Our platform is engineered in-house by 700+ experts using a modern stack and is trusted by traders worldwide for its reliable execution, deep liquidity, and low and stable spreads.

Founded in 2008, Exness has grown into a truly global organization with 2,000+ professionals across 13 countries. Our largest hub is in Cyprus—home to our engineering, product, and operations teams—while our commercial and customer service teams thrive across Malaysia, Uruguay, and several other global locations.

We believe great performance starts with great care. That’s why we offer more than generous benefits—from relocation support, private healthcare, and school coverage to learning budgets, share schemes, and full ownership of your work. At Exness, people come before process, and long-term thinking wins over short-term pressure.

In 2024, Exness was recognized as one of the Best Places to Work globally and received Great Place to Work® certification in 2022-2023 in Cyprus, reflecting our culture of trust, growth, and real care.

Whether you're an engineer, quant, product lead, customer specialist, or marketing strategist—if you're serious about impact, we're serious about you.

Explore сareers at Exness → https://exness-careers.com

Industry

Finance & Insurance

Company Size

5,001-10,000 employees

Headquarters

Limassol, CY

Year Founded

2008

Website

exness.com

Social Media

DataOps Engineer (AI Platform Engineer)