Job Description
Company Background
Our client is a technology startup building advanced voice automation solutions for the quick-service restaurant industry. The company develops a privacy-conscious, high-performance drive-thru voice assistant that automates real-time customer interactions, improves order accuracy, and helps restaurant chains increase revenue and operational efficiency. Its product is designed for fast deployment in noisy, high-volume drive-thru environments and is already gaining market traction through cooperation with a major restaurant chain.
Project Description
The project is a next-generation voice automation engine for drive-thru order-taking in quick-service restaurants. It enables fully automated, real-time conversations between customers and the restaurant ordering system using modern speech recognition, natural language processing, and text-to-speech technologies.
The engineer will work on improving the core voice and audio capabilities of the platform, including Speech-to-Text, Text-to-Speech, noise cancellation, speech enhancement, and real-time audio pipelines. The main focus will be on reducing latency, improving recognition quality and speech clarity, and making the system robust in extremely noisy real-world environments such as drive-thrus.
Technologies
- Speech-to-Text, Text-to-Speech
- Audio Engineering / DSP
- Noise Suppression, Voice Activity Detection, Signal Processing
- PyTorch / TensorFlow
- Real-Time Inference, Streaming Pipelines
- GPU Optimization, Edge Inference
- Production ML Systems
What You'll Do
- Optimize low-latency, real-time Speech-to-Text pipelines for production drive-thru environments;
- Improve Text-to-Speech naturalness, responsiveness, and overall conversational quality;
- Design, tune, and improve noise suppression, echo cancellation, and speech enhancement systems;
- Improve speech recognition accuracy and robustness under challenging acoustic conditions, including engine noise, weather, overlapping speech, poor microphone quality, and outdoor environments;
- Build and scale audio processing infrastructure for production deployments;
- Evaluate, benchmark, and compare speech models using real-world audio data and production scenarios;
- Experiment with modern Speech AI technologies, models, and architectures to improve system performance;
- Collaborate with LLM and conversational AI teams to improve end-to-end voice interaction quality;
Job Requirements
- Advanced Python development skills;
- Deep hands-on expertise with Speech-to-Text and Text-to-Speech systems;
- Proven experience improving speech recognition quality in noisy or otherwise challenging acoustic environments;
- Strong expertise in noise suppression, echo cancellation, voice activity detection, and speech enhancement;
- Strong understanding of real-time and streaming audio architectures, including conversational voice pipelines and real-time inference;
- Experience building low-latency, production-grade AI systems;
- Experience with modern speech AI frameworks, models, and APIs;
- Experience deploying and scaling AI services in cloud environments;
- Ability to troubleshoot complex audio quality, latency, and reliability issues;
- Product-oriented mindset with a focus on real-world performance, customer experience, and high ownership;
- Ability to collaborate effectively with engineering, LLM, and conversational AI teams;
- English level: B2 or higher;
What Do We Offer
The global benefits package includes:
- Technical and non-technical training for professional and personal growth;
- Internal conferences and meetups to learn from industry experts;
- Support and mentorship from an experienced employee to help you professional grow and development;
- Health insurance;
- English courses;
- Sports activities to promote a healthy lifestyle;
- Flexible work options, including remote and hybrid opportunities;
- Referral program for bringing in new talent;
- Work anniversary program and additional vacation days.