Visual exploration of mammography dataset

OMAMA-DB is a large public mammography dataset designed to support research in breast cancer screening, medical image analysis, and machine learning. It combines 2D full-field digital mammography (FFDM) images and 3D digital breast tomosynthesis (DBT) volumes, together with pathology-based cancer labels and automated lesion annotations. The dataset was created to address a major gap in the field: while mammography is one of the most important tools for early breast cancer detection, publicly accessible datasets are often either too small, incomplete, or difficult to obtain for broad computational research. OMAMA-DB was introduced to provide a large-scale, curated resource that can be used both for algorithm development and for visual analytics research. The attached paper describing the dataset will soon appear as:

Kanamarlapudi, A., Zurrin, R., Gaibor, E., Bendiksen Gutierrez, B., Goyal, N., Narayanappa, V. S., Simovici, D., Haspel, N., Pomplun, M., Lee, H., Bandler, M., Sorensen, G., & Haehn, D. OMAMA-DB: The Oregon–Massachusetts Mammography Database. Journal of Medical Imaging (JMI), 2026.

Starting from an institutional collection of 967,991 images, the creators of OMAMA-DB applied a multi-stage curation pipeline to remove images with missing labels, uncommon dimensions, rare scanner types, duplicate studies, corrupted files, and low-quality outliers. The final curated dataset contains 231,080 images in total, including 163,568 2D mammograms and 67,512 3D tomosynthesis volumes. It includes 7,351 cancer cases in the 2D subset and 374 cancer cases in the 3D subset, making it one of the largest publicly available datasets of its kind.

The 2D subset is available in multiple resolutions for rapid prototyping and training, while the 3D subset is provided separately due to its large size. Each image is stored in compressed NumPy format and paired with a JSON metadata file. The dataset is hosted on Harvard Dataverse, with separate persistent identifiers for the 2D and 3D subsets.

Metadata

Each image in OMAMA-DB is accompanied by structured JSON metadata. This metadata includes patient and acquisition information such as mammographic view, laterality, pixel spacing, and display window settings, as well as lesion information in the form of bounding box coordinates, confidence scores, and pathology-derived labels. The dataset includes the four standard mammography screening views: left and right craniocaudal (CC) and left and right mediolateral oblique (MLO). It is relatively balanced across view types and laterality, and it includes both 2D and 3D modalities, which makes it especially suitable for comparative analysis and visualization tasks.

A key component of the dataset is the automated cancer lesion annotation generated with DeepSight, a research version of the FDA-cleared Saige-Dx system. For each image, DeepSight detects suspicious regions and produces bounding boxes with confidence scores indicating the likelihood of malignancy. These annotations are stored as JSON metadata alongside the images.

Below is an outline of the metadata content:

2D Dataset Statistics

3D Dataset Details

cancer annotations
cancer annotations
Example data with cancer annotations, including image artifacts.

Tasks

While OMAMA-DB provides a rich and large-scale collection of mammography images together with metadata, pathology labels, and automated lesion annotations, current analysis workflows are typically limited to model training, quantitative evaluation, or inspection of individual cases. There is a strong need for visual analytics approaches that support understanding this dataset at scale while still enabling detailed inspection.

We thus challenge you to address one or multiple of the following tasks:

Dataset

The full dataset can be found here: https://dataverse.harvard.edu/dataverse/omama.
Follow these instructions to access the data.

For a glance at a sample data, you can check out this notebook.

Lekschas, Fritz, Xinyi Zhou, Wei Chen, Nils Gehlenborg, Benjamin Bach, and Hanspeter Pfister. “A generic framework and library for exploration of small multiples through interactive piling.” IEEE Transactions on Visualization and Computer Graphics 27, no. 2 (2020): 358-368. https://doi.org/10.1109/TVCG.2020.3028948

Questions?

Please feel free to send any questions to: biovis_challenge@ieeevis.org.

Chairs of the Bio+MedVis Challenge @ IEEE VIS 2026: