DDN

Staff Engineer, Lustre

DDN  •  $185k - $230k/yr  •  San Francisco, CA (Remote)  •  4 hours ago
Apply
AI can make mistakes so check important info. Chat history is never stored.

Job Description

This is an incredible opportunity to be part of a company that has been at the forefront of AI and high-performance data storage innovation for over two decades. DataDirect Networks (DDN) is a global market leader renowned for powering many of the world's most demanding AI data centers, in industries ranging from life sciences and healthcare to financial services, autonomous cars, Government, academia, research and manufacturing.

"DDN's A3I solutions are transforming the landscape of AI infrastructure." – IDC

“The real differentiator is DDN. I never hesitate to recommend DDN. DDN is the de facto name for AI Storage in high performance environments” - Marc Hamilton, VP, Solutions Architecture & Engineering | NVIDIA

DDN is the global leader in AI and multi-cloud data management at scale. Our cutting-edge data intelligence platform is designed to accelerate AI workloads, enabling organizations to extract maximum value from their data. With a proven track record of performance, reliability, and scalability, DDN empowers businesses to tackle the most challenging AI and data-intensive workloads with confidence.

Our success is driven by our unwavering commitment to innovation, customer-centricity, and a team of passionate professionals who bring their expertise and dedication to every project. This is a chance to make a significant impact at a company that is shaping the future of AI and data management.

Our commitment to innovation, customer success, and market leadership makes this an exciting and rewarding role for a driven professional looking to make a lasting impact in the world of AI and data storage.

We are seeking a Staff Engineer – LustreFS with 10+ years of experience in distributed storage and Linux-based systems engineering. This is a hands-on senior technical role focused on design, debugging, performance, and operational excellence across LustreFS and adjacent stack components. The ideal candidate brings strong expertise in one or more Lustre subsystems, can independently drive complex investigations, and collaborates effectively across engineering, QE, support and release teams. Engineers who are comfortable using AI to accelerate triage, debugging, code comprehension and new feature design will be especially valuable.

Key Responsibilities

  • Design, develop and debug LustreFS features, fixes and enhancements across relevant subsystems such as llite, MDS/MDT, OSS/OST, LDLM and LNet.
  • Investigate customer and scale-related defects, drive root-cause analysis and implement high-quality fixes with strong attention to correctness and maintainability.
  • Contribute to performance tuning, failure analysis and reliability improvements for large-scale Lustre deployments.
  • Participate actively in code reviews, design reviews and subsystem discussions, bringing rigor to testing and operational readiness.
  • Work closely with QE and support to reproduce issues, improve diagnostic data quality and increase coverage for high-risk failure scenarios.
  • Help document subsystem behavior, debugging approaches, known failure patterns and operational best practices.
  • Use AI-assisted tools where appropriate to speed up issue triage, summarize logs, improve code understanding and capture reusable lessons learned.

Required Qualifications

  • 10+ years of experience in systems software, distributed systems, storage, Linux kernel or filesystem engineering.
  • Strong experience in LustreFS development, support or performance engineering with depth in at least one major subsystem.
  • Strong C programming and Linux systems debugging skills.
  • Working knowledge of Linux kernel internals, filesystem semantics, networking and performance analysis.
  • Experience with LNet and/or high-performance transports such as RDMA, InfiniBand, RoCE or TCP-based storage networking.
  • Ability to debug and resolve issues spanning multiple layers including client, server, network and backend storage.
  • Strong collaboration skills and the ability to work across functions in a fast-moving engineering environment.

Preferred Skills

  • Experience in HPC, AI infrastructure or large-scale parallel storage environments.
  • Exposure to metadata-heavy and throughput-heavy workload characterization and tuning.
  • Familiarity with ZFS, ldiskfs, NVMe-backed storage and related observability / performance tooling.
  • Experience creating test plans, reproducer frameworks, runbooks or diagnostic automation.
  • Comfort using AI tools to accelerate debugging, code reviews, triage, documentation and early-stage design ideation.
  • Experience mentoring junior engineers or leading focused technical efforts within a subsystem.

What You Will Work On

  • Hands-on development and debugging of LustreFS defects, performance issues and subsystem enhancements.
  • Customer-facing and scale-related issue investigation across llite, metadata, object storage, LNet and transport layers.
  • Collaborative design and implementation of reliability, observability and serviceability improvements.
  • Reviewing and validating fixes through targeted tests, failure injection, log analysis and performance characterization.
  • Using AI-assisted workflows to accelerate triage, debug loops, code understanding and documentation quality.
  • Contributing to team redundancy by strengthening documentation, code review quality and subsystem knowledge sharing.

Why This Role Matters

This role is central to building durable engineering redundancy in LustreFS: expanding deep subsystem ownership, reducing concentration risk, and accelerating next-generation delivery through strong engineering fundamentals and AI-enabled execution.

Salary Range for this role: $185,000 - $230,000

DDN

Join our dynamic and driven team, where engineering excellence is at the heart of everything we do. We seek individuals who love to challenge themselves and are fueled by curiosity. Here, you'll have the opportunity to work across various areas of the company, thanks to our flat organizational structure that encourages hands-on involvement and direct contributions to our mission. Leadership is earned by those who take initiative and consistently deliver outstanding results, both in their work ethic and deliverables, making strong prioritization skills essential. Additionally, we value strong communication skills in all our engineers and researchers, as they are crucial for the success of our teams and the company as a whole.

Interview Process: After submitting your application, one of our recruiters will review your resume. If your application passes this stage, you will be invited to a 30-minute interview during which a member of our team will ask some basic questions. If you clear the interview, you will enter the main process, which can consist of up to four interviews in total:

  • Coding assessment: Often in a language of your choice.
  • Systems design: Translate high-level requirements into a scalable, fault-tolerant service (depending on role).
  • Real-time problem-solving: Demonstrate practical skills in a live problem-solving session.
  • Meet and greet with the wider team.
  • Our goal is to finish the main process in 2-3 weeks at most.

DataDirect Networks (DDN) is an Equal Opportunity/Affirmative Action employer. All qualified applicants will receive consideration for employment without regard to race, color, religion, gender, gender identity, gender expression, transgender, sex stereotyping, sexual orientation, national origin, disability, protected Veteran Status, or any other characteristic protected by applicable federal, state, or local law.

#LI-Remote

DDN

About DDN

DDN (DataDirect Networks) is the world’s leading AI and data intelligence company, empowering organizations to maximize the value of their data with end-to-end HPC and AI-focused solutions. Its customers range from the largest global enterprises and AI hyperscalers to cutting-edge research centers, all leveraging DDN’s proven data intelligence platform for scalable, secure, and high-performance AI deployments that drive 10x returns.

Industry
IT & Software
Company Size
1,001-5,000 employees
Headquarters
Chatsworth, CA
Year Founded
Unknown
Website
ddn.com
Social Media