Croissant: A Metadata Format for ML-Ready Datasets

Mubashara Akthar, Omar Benjelloun, Costanza Conforti, Luca Foschini, Pieter Gijsbers, Joan Giner Miguelez, Sujata Goswami, Nitisha Jain, Michalis Karamousadakis, Satyapriya Krishna, Michael Kuchnik, Sylvain Lesage, Quentin Lhoest, Pierre Marcenac, Manil Maskey, Peter Mattson, Luis Oala, Hamidah Oderinwale, Pierre Ruyssen, Tim SantosRajat Shinde, Elena Simperl, Arjun Suresh, Geoffry Thomas, Vyacheslav Tykhonov, Joaquin Vanschoren, Susheel Varma, Jos van der Velde, Steffen Vogler, Carole-Jean Wu, Luyao Zhang

Onderzoeksoutput: Hoofdstuk in boek/boekdeelBijdrage aan conferentie proceedingsWetenschappelijkpeer review

Samenvatting

Data is a critical resource for machine learning (ML), yet working with data remains a key friction point. This paper introduces Croissant, a metadata format for datasets that creates a shared representation across ML tools, frameworks, and platforms. Croissant makes datasets more discoverable, portable, and interoperable, thereby addressing significant challenges in ML data management. Croissant is already supported by several popular dataset repositories, spanning hundreds of thousands of datasets, enabling easy loading into the most commonly-used ML frameworks, regardless of where the data is stored. Our initial evaluation by human raters shows that Croissant metadata is readable, understandable, complete, yet concise.
Originele taal-2Engels
Titel38th Conference on Neural Information Processing Systems (NeurIPS 2024)
SubtitelTrack on Datasets and Benchmarks
Plaats van productieVancouver, Canada
UitgeverijNeurIPS
Aantal pagina's26
StatusGepubliceerd - 12 dec. 2024

Citeer dit