Meta

Production Systems Engineer, AI Systems

Meta  •  Menlo Park, CA (Onsite)  •  6 hours ago
Apply
AI can make mistakes so check important info. Chat history is never stored.

Job Description

Meta is seeking a Hardware Systems Engineer to join our Release to Production (RTP) team. As a key member of the RTP team, you will be responsible for driving the end-to-end hardware lifecycle of Meta's servers, from prototyping and pre-production to production-ready system monitoring, automated provisioning, and remediation of issues. You will work closely with cross-functional teams, including hardware designers, networking teams, system manufacturers, component vendors, capacity engineering, production engineering, production services, and data center operations teams to enable new systems that will be deployed in our production data centers.

Responsibilities
Drive and execute comprehensive end-to-end system validation strategy (hardware and software) for various AI/HPC hardware systems in datacenter applications
* Lead the bring-up, validation, and deployment of cutting-edge hardware systems in large-scale deployment with active hands-on participation
* Explore new use cases with customer teams and identify related test methodologies/test cases accordingly
* Investigate and troubleshoot complex failures potentially related to hardware systems with cross-functional teams
* Triage failures and continue root-causing while driving project development work forward
* Identify gaps and opportunities to improve the test process and test methodologies across the New Product Introduction (NPI) space
* Guide automation efforts and data analysis for New Product Introduction projects through engagement with related cross-functional teams
* Communicate project progress and assessments to the related internal and external teams
* Interface with external vendors and internal hardware, mechanical, power, thermal, manufacturing, and software engineers to understand the system's architecture
* Develop visibility through data visualization and implement systemic solutions to hardware health issues
* Proactively create experiments and tooling to detect and diagnose hardware/firmware/software health issues

Qualifications
Bachelor's degree in Computer Science, Computer Engineering, relevant technical field, or equivalent practical experience
* 8+ years of experience in hands-on software, firmware or hardware engineering to build any of the following products (AI silicon, GPUs, TPUs, Autonomous cars, AI servers)
* Experience in one or more domains such as: ASIC development (silicon design, bringup, characterization, validation), board-level debug, firmware validation, system validation
* Knowledge of architecture and components on one of the following products: server/PC/Laptop
* Development or debug experience in one or more of the following areas: hardware fault management, error reporting, error handling on hardware products 6+ years experience in Networking space: Switches, Network Interface Cards (NICs), DPU etc
* Knowledge of TCP/IP and experience using tools like iperf/uperf
* Experience working with RDMA/RoCE, including scale-out networks
* Experience working with AI server systems
* Experience working with large scale deployments
Meta

About Meta

Meta's mission is to build the future of human connection and the technology that makes it possible.

Our technologies help people connect, find communities, and grow businesses. When Facebook launched in 2004, it changed the way people connect. Apps like Messenger, Instagram and WhatsApp further empowered billions around the world. Now, Meta is moving beyond 2D screens toward immersive experiences like augmented and virtual reality to help build the next evolution in social technology.

To help create a safe and respectful online space, we encourage constructive conversations on this page. Please note the following:

• Start with an open mind. Whether you agree or disagree, engage with empathy.

• Comments violating our Community Standards will be removed or hidden. Please treat everybody with respect.

• Keep it constructive. Use your interactions here to learn about and grow your understanding of others.

• Our moderators are here to uphold these guidelines for the benefit of everyone, every day.

• If you are seeking support for issues related to your Facebook account, please reference our Help Center (https://www.facebook.com/help) or Help Community (https://www.facebook.com/help/community).

For a full listing of our jobs, visit https://www.metacareers.com

Industry
IT & Software
Company Size
10,000+ employees
Headquarters
Menlo Park, CA
Year Founded
2004
Website
meta.com
Social Media