TY - CHAP
T1 - Croissant: A Metadata Format for ML-Ready Datasets
AU - Akthar, Mubashara
AU - Benjelloun, Omar
AU - Conforti, Costanza
AU - Foschini, Luca
AU - Gijsbers, Pieter
AU - Giner Miguelez, Joan
AU - Goswami, Sujata
AU - Jain, Nitisha
AU - Karamousadakis, Michalis
AU - Krishna, Satyapriya
AU - Kuchnik, Michael
AU - Lesage, Sylvain
AU - Lhoest, Quentin
AU - Marcenac, Pierre
AU - Maskey, Manil
AU - Mattson, Peter
AU - Oala, Luis
AU - Oderinwale, Hamidah
AU - Ruyssen, Pierre
AU - Santos, Tim
AU - Shinde, Rajat
AU - Simperl, Elena
AU - Suresh, Arjun
AU - Thomas, Geoffry
AU - Tykhonov, Vyacheslav
AU - Vanschoren, Joaquin
AU - Varma, Susheel
AU - van der Velde, Jos
AU - Vogler, Steffen
AU - Wu, Carole-Jean
AU - Zhang, Luyao
PY - 2024/12/12
Y1 - 2024/12/12
N2 - Data is a critical resource for machine learning (ML), yet working with data remains a key friction point. This paper introduces Croissant, a metadata format for datasets that creates a shared representation across ML tools, frameworks, and platforms. Croissant makes datasets more discoverable, portable, and interoperable, thereby addressing significant challenges in ML data management. Croissant is already supported by several popular dataset repositories, spanning hundreds of thousands of datasets, enabling easy loading into the most commonly-used ML frameworks, regardless of where the data is stored. Our initial evaluation by human raters shows that Croissant metadata is readable, understandable, complete, yet concise.
AB - Data is a critical resource for machine learning (ML), yet working with data remains a key friction point. This paper introduces Croissant, a metadata format for datasets that creates a shared representation across ML tools, frameworks, and platforms. Croissant makes datasets more discoverable, portable, and interoperable, thereby addressing significant challenges in ML data management. Croissant is already supported by several popular dataset repositories, spanning hundreds of thousands of datasets, enabling easy loading into the most commonly-used ML frameworks, regardless of where the data is stored. Our initial evaluation by human raters shows that Croissant metadata is readable, understandable, complete, yet concise.
KW - machine learning
KW - ml standard
KW - Artificial Intelligence
M3 - Contribution to conference proceedings
BT - 38th Conference on Neural Information Processing Systems (NeurIPS 2024)
PB - NeurIPS
CY - Vancouver, Canada
ER -