SAIGroup

Data Manager — Multimodal Medical Foundation Models

SAIGroup  •  Bengaluru, IN (Onsite)  •  4 months ago
Apply
AI can make mistakes so check important info. Chat history is never stored.

Job Description

About the Role

You will lead data operations for a cutting-edge research group developing 3D medical multimodal foundation modelsand agentic clinical AI systems These models rely on extremely high-quality, well-structured, and compliant datasets—including 3D medical imaging volumes (MRI, CT, PET), clinical text corpora, annotations, and multimodal metadata

Your job is to own the end-to-end data lifecycle: acquisition, ingestion, cleaning, versioning, labeling, quality control, governance, and delivery to researchers You are the central node ensuring our foundation model teams and medical agent teams have clean, scalable, well-documented data pipelines.

This is a pivotal foundational role—without great data, large models cannot be great.

What You Will Work On

Multimodal Medical Data Ops

  • Oversee ingestion and processing of 3D medical volumes (DICOM, NIfTI, MHA) and associated clinical texts.
  • Build automated pipelines for metadata extraction, de-identification, slice/series validation, and cohort structuring
  • Manage large-scale internal datasets and external research datasets (BraTS, LiTS, MIMIC-CXR, CheXpert, MosMed, etc.).

Data Infrastructure & Versioning

  • Implement scalable data storage, cataloging, and retrieval systems for multimodal training data.
  • Own dataset version control, lineage tracking, reproducibility, and dataset documentation.
  • Collaborate with ML systems engineers on high-throughput data loaders, sharding strategies, and caching mechanisms

Annotation & Labeling Programs

  • Lead medical annotation workflows with radiologists, medical students, and labeling vendors.
  • Create guidelines for ROI labeling, segmentation, captioning, report alignment, and case-level curation
  • Build semi-automated labeling pipelines using model-assisted tools.

Data Quality, Compliance & Governance

  • Enforce strict standards on data quality, completeness, consistency, and bias control
  • Ensure adherence to medical data privacy, HIPAA-equivalent frameworks, and institutional data-sharing rules.
  • Manage PHI de-identification, audit logs, access control, and compliance approvals.

Collaboration with Research & Engineering

  • Work closely with foundation-model researchers to understand data needs for model training.
  • Partner with agentic system designers to supply structured datasets for clinical reasoning tasks.
  • Collaborate with foundational engineers on data access layers, performance bottlenecks, and dataset optimization.

Why This Role Is Critical

  • The foundation model relies on high-quality 3D and textual data at scale.
  • You shape the data pipelines enabling next-generation medical AI agents.
  • You ensure clinical-grade governance, safety, reproducibility, and trust.
  • Your systems become the backbone for research, experiments, and deployments.

For candidates motivated by the intersection of data, healthcare, and machine learning, this is a high-impact opportunity.

What We’re Looking For

  • Strong experience managing large multimodal or imaging datasets, ideally medical imaging.
  • Proficiency with DICOM/DICOMweb, NIfTI, PACS systems, and medical imaging toolkits (dicompyler, pydicom, MONAI, ITK).
  • Experience with ETL pipelines, distributed data systems, and cloud/on-prem storage.
  • Knowledge of metadata standards, ontologies, and text–image linking strategies.
  • Comfortable working with Python, SQL, and data tooling (Airflow, Prefect, Dagster, DBT, Delta Lake, etc.).
  • Understanding of data privacy, de-identification, and compliance requirements in healthcare.
  • Strong communication skills and the ability to coordinate between engineers, researchers, clinicians, and data partners.

Nice to Have

  • Experience with vector databases, multimodal retrieval, or embedding store design.
  • Familiarity with annotation tools (Labelbox, CVAT, iMerit, custom MONAI Label pipelines).
  • Prior work with clinical NLP datasets or multilingual Indian medical corpora.
  • Experience conducting bias audits, dataset characterization, or quality scoring at scale.
  • Contributions to open datasets, benchmarks, or data documentation frameworks.

What We Offer

  • Competitive compensation.
  • Access to one of the most ambitious medical multimodal datasets in the region.
  • Collaboration with scientists building India’s first 3D multimodal medical foundation model.
  • Autonomy to design data systems from the ground up.
  • A mission-driven team working to transform clinical care with agentic AI.
SAIGroup

About SAIGroup

SAIGroup is a private investment company focused on acquiring and growing enterprise AI companies. The SAIGroup portfolio includes SymphonyAI, ConcertAI, and RhythmX AI.

Industry
Finance & Insurance
Company Size
11-50 employees
Headquarters
Palo Alto, California
Year Founded
2017
Social Media