Job Description

Reality Labs at Meta is seeking a Research Scientist with expertise in multi-modal understanding to advance AI-powered interactions. We're building next-generation capabilities that integrate vision, language, audio, and sensor modalities. This is a unique opportunity to conduct cutting-edge multi-modal research with direct product impact.

Responsibilities
Lead the design, development, and optimization of multi-modal models that integrate vision, language, audio, and sensor inputs
* Set technical direction for multi-modal research projects
* Conduct research and experiments to improve cross-modal alignment and fusion strategies
* Collaborate with cross-functional teams (engineering, HCI, product) to transition multi-modal research into production
* Explore and adopt novel model optimization, quantization, and efficiency techniques
* Stay current with state-of-the-art advances in multi-modal learning, vision-language models, and related fields

Qualifications
Bachelor's degree in Computer Science, Computer Engineering, relevant technical field, or equivalent practical experience
* Currently has, or is in the process of obtaining, a PhD in Computer Science, Machine Learning, Computer Vision, or a related technical field. Degree must be completed prior to joining Meta
* Demonstrated expertise in multi-modal learning — including architecture design, training, and cross-modal alignment techniques
* Programming experience in Python and hands-on experience with deep learning frameworks such as PyTorch
* Experience developing machine learning models at scale from inception to impact
* 5+ years of research experience working autonomously on ML problems involving multiple modalities (vision, language, audio, or sensor data) Deep expertise in vision-language models, cross-modal attention mechanisms, or contrastive learning approaches
* First-authored publications at peer-reviewed AI conferences (e.g., CVPR, NeurIPS, ICML, ICLR, ACL, ECCV)
* Experience with on-device or edge multi-modal model optimization (quantization, sparsity, distillation)
* Demonstrated software engineering experience via internship, work experience, or widely used contributions in open source repositories
* Experience bringing multi-modal AI products from research to production
* Proven track record of developing multi-modal models that fuse vision, language, and/or audio for real-world applications

About Meta

Meta's mission is to build the future of human connection and the technology that makes it possible.

Our technologies help people connect, find communities, and grow businesses. When Facebook launched in 2004, it changed the way people connect. Apps like Messenger, Instagram and WhatsApp further empowered billions around the world. Now, Meta is moving beyond 2D screens toward immersive experiences like augmented and virtual reality to help build the next evolution in social technology.

To help create a safe and respectful online space, we encourage constructive conversations on this page. Please note the following:

• Start with an open mind. Whether you agree or disagree, engage with empathy.

• Comments violating our Community Standards will be removed or hidden. Please treat everybody with respect.

• Keep it constructive. Use your interactions here to learn about and grow your understanding of others.

• Our moderators are here to uphold these guidelines for the benefit of everyone, every day.

• If you are seeking support for issues related to your Facebook account, please reference our Help Center (https://www.facebook.com/help) or Help Community (https://www.facebook.com/help/community).

For a full listing of our jobs, visit https://www.metacareers.com

Industry

IT & Software

Company Size

10,000+ employees

Headquarters

Menlo Park, CA

Year Founded

2004

Website

meta.com

Social Media