Meta

Data Center Production Operations Engineer (Third Shift)

Meta  •  New Albany, OH (Onsite)  •  14 hours ago
Apply
AI can make mistakes so check important info. Chat history is never stored.

Job Description

Meta is seeking a Data Center Production Operations Engineer to support the reliability, efficiency, and scalability of our global data center infrastructure. In this role, you will be responsible for the day-to-day operational health of server fleets and production systems that underpin Meta's family of apps and services. You will work at the intersection of hardware lifecycle management, systems reliability, and operational process improvement, ensuring that production environments meet the demands of billions of users worldwide.

Responsibilities
Manage and maintain large-scale server fleets across data center environments, including hardware triage, failure analysis, and coordinating repair and replacement workflows
* Monitor production systems health using observability tooling and telemetry data to proactively identify and resolve infrastructure anomalies before they impact service availability
* Develop and refine operational runbooks, escalation procedures, and incident response playbooks specific to data center server environments
* Collaborate with hardware engineering, network operations, and capacity planning teams to support server deployment, decommissioning, and lifecycle transitions
* Analyze failure trends and operational data to identify systemic issues in server hardware or firmware, and drive root cause analysis and corrective action
* Contribute to automation initiatives that reduce manual toil in server provisioning, health checks, and fleet management workflows, including leveraging AI-integrated tooling
* Partner with cross-functional teams to evaluate and implement process improvements that increase operational efficiency and reduce mean time to resolution for production incidents
* Communicate infrastructure status, incident timelines, and risk assessments to engineering and operations stakeholders through clear written and verbal updates
* Support capacity readiness activities by validating server acceptance criteria and coordinating with data center technicians during hardware bring-up and commissioning
* Identify gaps in monitoring coverage or operational tooling and propose solutions that improve fleet visibility and production reliability
* Participate in 24/7 on-call rotation
* Ability to travel up to 15% of the time
* Required to work a shifted schedule (includes nights and weekends)

Qualifications
6+ years of experience in data center operations, site operations, or production infrastructure engineering supporting large-scale server environments
* 6+ years of experience with server hardware components including CPUs, memory, storage, and network interface cards, including hands-on troubleshooting and failure diagnosis
* Experience using systems monitoring and observability platforms to track fleet health, identify anomalies, and drive incident resolution in production data center environments
* Experience developing or improving operational processes, runbooks, or automation scripts to support server fleet management at scale
* Experience collaborating with hardware engineering, network, and capacity teams to coordinate infrastructure deployments and lifecycle activities Experience contributing to post-incident reviews and translating findings into durable operational improvements that reduce recurrence across a server fleet
* Background in capacity planning or hardware acceptance testing processes within a large-scale cloud or hyperscale data center organization
* Familiarity with server firmware management, BIOS configuration, and out-of-band management interfaces such as IPMI or Redfish in hyperscale data center environments
* Experience with scripting languages such as Python or Bash to automate data center operations tasks including health checks, inventory management, or alerting workflows
Meta

About Meta

Meta's mission is to build the future of human connection and the technology that makes it possible.

Our technologies help people connect, find communities, and grow businesses. When Facebook launched in 2004, it changed the way people connect. Apps like Messenger, Instagram and WhatsApp further empowered billions around the world. Now, Meta is moving beyond 2D screens toward immersive experiences like augmented and virtual reality to help build the next evolution in social technology.

To help create a safe and respectful online space, we encourage constructive conversations on this page. Please note the following:

• Start with an open mind. Whether you agree or disagree, engage with empathy.

• Comments violating our Community Standards will be removed or hidden. Please treat everybody with respect.

• Keep it constructive. Use your interactions here to learn about and grow your understanding of others.

• Our moderators are here to uphold these guidelines for the benefit of everyone, every day.

• If you are seeking support for issues related to your Facebook account, please reference our Help Center (https://www.facebook.com/help) or Help Community (https://www.facebook.com/help/community).

For a full listing of our jobs, visit https://www.metacareers.com

Industry
IT & Software
Company Size
10,000+ employees
Headquarters
Menlo Park, CA
Year Founded
2004
Website
meta.com
Social Media