Croissant: Metadata for Machine Learning Systems

Vyacheslav Tykhonov, Joan Giner Miguelez

Onderzoeksoutput: Andere bijdrageWetenschappelijk

Samenvatting

Data is vital for machine learning (ML), yet managing it remains a challenge. We present Croissant, a metadata format that standardizes dataset representation across ML tools, frameworks, and platforms. Croissant enhances dataset discoverability, portability, and interoperability, already supporting hundreds of thousands of datasets in popular repositories. It allows seamless integration with widely-used ML frameworks regardless of data storage location. Human evaluations confirm Croissant's metadata as readable, concise, and complete. The vision is a shared Data Lake enabling federated search across platforms like Dataverse, Kaggle, and HuggingFace. A centralized approach focuses on standardization and repository-level harmonization, while a distributed approachadvocates agile, Linked Data-based solutions that empower diverse communities to integrate within a Distributed Data Network using Croissant ML and AI technologies.
Originele taal-2Engels
Mijlpalentype toekennenPresentation
UitgeverSchloss Dagstuhl - Leibniz-Zentrum für Informatik
Aantal pagina's27
DOI's
StatusGepubliceerd - 12 okt. 2024

Citeer dit