Google

System Hardware Reliability Manager, AI Infrastructure

Google  •  Taipei, TW (Onsite)  •  2 days ago
Apply
AI can make mistakes so check important info. Chat history is never stored.

Job Description


Minimum qualifications:

  • Bachelor's degree in Electrical Engineering, Mechanical Engineering, Reliability Engineering, Materials Science, or a related technical discipline, or equivalent practical experience.
  • 10 years of experience in manufacturing.
  • 8 years of experience in people management.

Preferred qualifications:

  • Experience with large-scale data center infrastructure, high-density compute/server topologies, or power/cooling sub-systems.
  • Demonstrated experience in performing risk mitigation during early design phases using predictive modeling or reliability simulations before design lockdown.
  • Experience designing and executing accelerated life testing (ALT, HALT) and manufacturing detection profiles tailored to data center environmental profiles.
  • Deep expertise in structured problem-solving methodologies (e.g., 8D, FMEA, FTA) and physical failure analysis for complex electronic assemblies or server-grade hardware.
  • Strong background in data analysis tools (e.g., JMP, SQL, Python/R) for life-data analysis, Weibull modeling, and predicting fleet-wide failure rates.

About the job

Be part of a team that pushes boundaries, developing custom silicon solutions that power the future of Google's direct-to-consumer products. You'll contribute to the innovation behind products loved by millions worldwide. Your expertise will shape the next generation of hardware experiences, delivering unparalleled performance, efficiency, and integration.

In this role, you will lead the team responsible for building reliability into our products from early architecture through global deployment. You will shift our focus from reactive troubleshooting to scalable strategy, partnering with Design teams and APAC manufacturers to define specifications and mitigate hardware risks before they hit production. Ultimately, you will own the technical strategy for NPI reliability frameworks, drive systemic root-cause failure analysis, and oversee the health of our active global fleet to ensure our infrastructure remains highly resilient.

The AI and Infrastructure team is redefining what’s possible. We empower Google customers with breakthrough capabilities and insights by delivering AI and Infrastructure at unparalleled scale, efficiency, reliability and velocity. Our customers include Googlers, Google Cloud customers, and billions of Google users worldwide.

We're the driving team behind Google's groundbreaking innovations, empowering the development of our AI models, delivering unparalleled computing power to global services, and providing the essential platforms that enable developers to build the future. From software to hardware our teams are shaping the future of world-leading hyperscale computing, with key teams working on the development of our TPUs, Vertex AI for Google Cloud, Google Global Networking, Data Center operations, systems research, and much more.

Be part of a team that pushes boundaries, developing custom silicon solutions that power the future of Google's direct-to-consumer products. You'll contribute to the innovation behind products loved by millions worldwide. Your expertise will shape the next generation of hardware experiences, delivering unparalleled performance, efficiency, and integration.

Responsibilities

  • Coach, mentor, and scale a Reliability Engineering team across planning, validation, and fleet failure analysis, optimizing resource allocation to navigate evolving data center complexities at a fast-moving pace.
  • Oversee manufacturing stability to ensure intrinsic product reliability across all verticals at APAC contract manufacturer locations, proactively identifying workflow opportunities to better support dynamic business needs.
  • Drive Design for Reliability (DfR) methodologies and DFMEAs from the initial concept phase, formalizing a lessons learned pipeline to directly shape design rules for next-generation ML hardware.
  • Lead high-priority investigations for complex, intermittent field reliability failures, guiding internal teams, OEMs, and external laboratories through advanced failure analysis techniques to validate conclusions and enforce strict remediation standards.
  • Utilize statistical tools, physics-of-failure models, and internal reliability data to predict product life performance, feedback application stress, enable early detection, and define comprehensive end-of-life strategies.
Google

About Google

A problem isn't truly solved until it's solved for all. Googlers build products that help create opportunities for everyone, whether down the street or across the globe. Bring your insight, imagination and a healthy disregard for the impossible. Bring everything that makes you unique. Together, we can build for everyone.

Check out our career opportunities at goo.gle/3DLEokh

Industry
IT & Software
Company Size
10,000+ employees
Headquarters
Mountain View, CA
Year Founded
Unknown
Social Media