Croissant: A Metadata Format for ML-Ready Datasets

Mubashara Akhtar, Omar Benjelloun, Costanza Conforti, Pieter Gijsbers, Joan Giner-Miguelez, Nitisha Jain, Michael Kuchnik, Quentin Lhoest, Pierre Marcenac, Manil Maskey, Peter Mattson, Luis Oala, Pierre Ruyssen, Rajat Shinde, Elena Simperl, Goeffry Thomas, Slava Tykhonov, Joaquin Vanschoren, Jos Van Der Velde, Steffen VoglerCarole Jean Wu

Research output: Contribution to conference typesPaperpeer-review

Abstract

Data is a critical resource for Machine Learning (ML), yet working with data remains a key friction point. This paper introduces Croissant, a metadata format for datasets that simplifies how data is used by ML tools and frameworks. Croissant makes datasets more discoverable, portable and interoperable, thereby addressing significant challenges in ML data management and responsible AI. Croissant is already supported by several popular dataset repositories, spanning hundreds of thousands of datasets, ready to be loaded into the most popular ML frameworks.

Original languageEnglish
Pages1-6
Number of pages6
DOIs
Publication statusPublished - 9 Jun 2024
Event8th Workshop on Data Management for End-to-End Machine Learning, DEEM 2024 - Santiago, Chile
Duration: 9 Jun 2024 → …

Conference

Conference8th Workshop on Data Management for End-to-End Machine Learning, DEEM 2024
Country/TerritoryChile
CitySantiago
Period9/06/2024 → …

Keywords

  • discoverability
  • ML datasets
  • reproducibility
  • responsible AI

Fingerprint

Dive into the research topics of 'Croissant: A Metadata Format for ML-Ready Datasets'. Together they form a unique fingerprint.

Cite this