Meta

Global Production Systems Engineer

Meta  •  New Albany, OH / Menlo Park, CA (Onsite)  •  3 hours ago
Apply
AI can make mistakes so check important info. Chat history is never stored.

Job Description

Meta is seeking a forward-thinking, experienced Production Systems Engineer to join the Data Center Operations team. Our data centers, and the tens of thousands of servers installed in them, are the foundation upon which our rapidly scaling infrastructure efficiently operates and upon which our innovative services are delivered. Meta is at the leading edge of the global data center industry, both in terms of how data centers are designed and operated. This role requires prioritizing competing workstreams based on operational impact and adjusting plans as infrastructure needs evolve.

The candidate we seek is a forward-thinking IT professional with deep experience in utilizing multiple diverse software tools to identify automation solutions intended to address complex operational issues. This role is deeply cross-functional and considers the technical needs of frontline users to identify and automate diagnostic tooling, which enables quality and efficient delivery of production servers. They should be able to perform deep data analysis to drive decisions on the top priorities for automating repairs on servers in a hyperscale environment. This role requires driving solutions through code and collaborating effectively with globally distributed teams via clear written and verbal communication. Experience managing servers, programming in scripting languages, and administering Linux systems is required.

Responsibilities
Identify and root cause systemic issues in the fleet and drive resolutions. Deliver maximum server fleet uptime and utilization rates, by leveraging data to understand hardware failure conditions and root cause
* Write and review code, develop documentation, and debug the hardest problems, live, on some of the largest and most complex systems in the world
* Own and develop diagnostic tooling requirements to run the fleet
* Own and drive the escalation process for Data Center Operations to identify, root cause, and solve complex tooling and hardware issues affecting the fleet
* Execute operational validation and verification activities for the new product integration
* Through consistent collaboration with cross-functional tooling teams, helps determine the root cause and provides input into their development process, with an operations-centric view of how open issues are affecting the fleet
* Build cross-functional relationships and have the ability to influence policies and procedures to improve global data center operations
* Mentor team members to evaluate and identify better ways to resolve issues and define updates to tools and processes
* Travel up to 25% to support global data center operations

Qualifications
Bachelor's degree in Computer Science, Computer Engineering, relevant technical field, or equivalent practical experience
* 6+ years of experience in production systems engineering, infrastructure engineering, or systems software development for large-scale hardware environments
* 6+ years of experience with hardware lifecycle management, fleet automation, or data center operations systems spanning compute, storage, or networking infrastructure
* Experience developing systems software or tooling in Python, PHP, C, or C++ for Linux-based production environments at scale
* Experience in configuration and maintenance of applications such as web servers, load balancers, relational databases, storage systems and messaging systems
* Experience communicating technical designs and infrastructure decisions through written documentation and cross-functional stakeholder alignment across engineering and operations teams Experience designing or operating configuration management and infrastructure-as-code systems for large heterogeneous hardware fleets
* Experience supporting global, multi-site data center infrastructure deployments including hardware qualification and regional rollout coordination
* Familiarity with distributed systems monitoring, alerting, and automated remediation pipelines at hyperscale
Meta

About Meta

Meta's mission is to build the future of human connection and the technology that makes it possible.

Our technologies help people connect, find communities, and grow businesses. When Facebook launched in 2004, it changed the way people connect. Apps like Messenger, Instagram and WhatsApp further empowered billions around the world. Now, Meta is moving beyond 2D screens toward immersive experiences like augmented and virtual reality to help build the next evolution in social technology.

To help create a safe and respectful online space, we encourage constructive conversations on this page. Please note the following:

• Start with an open mind. Whether you agree or disagree, engage with empathy.

• Comments violating our Community Standards will be removed or hidden. Please treat everybody with respect.

• Keep it constructive. Use your interactions here to learn about and grow your understanding of others.

• Our moderators are here to uphold these guidelines for the benefit of everyone, every day.

• If you are seeking support for issues related to your Facebook account, please reference our Help Center (https://www.facebook.com/help) or Help Community (https://www.facebook.com/help/community).

For a full listing of our jobs, visit https://www.metacareers.com

Industry
IT & Software
Company Size
10,000+ employees
Headquarters
Menlo Park, CA
Year Founded
2004
Website
meta.com
Social Media