Hub

API docs

Join the Hugging Face community

and get access to the augmented documentation experience

Collaborate on models, datasets and Spaces

Faster examples with accelerated inference

Switch between documentation themes

to get started

Data files Configuration

There are no constraints on how to structure dataset repositories.

However, if you want the Dataset Viewer to show certain data files, or to separate your dataset in train/validation/test splits, you need to structure your dataset accordingly.Often it is as simple as naming your data files according to their split names, e.g.train.csv andtest.csv.

What are splits and subsets?

Machine learning datasets typically have splits and may also have subsets. A dataset is generally made ofsplits (e.g.train andtest) that are used during different stages of training and evaluating a model. Asubset (also calledconfiguration) is a sub-dataset contained within a larger dataset. Subsets are especially common in multilingual speech datasets where there may be a different subset for each language. If you’re interested in learning more about splits and subsets, check out theSplits and subsets guide!

split-configs-server

Automatic splits detection

Splits are automatically detected based on file and directory names. For example, this is a dataset withtrain,test, andvalidation splits:

my_dataset_repository/├── README.md├── train.csv├── test.csv└── validation.csv

To structure your dataset by naming your data files or directories according to their split names, see theFile names and splits documentation and thecompanion collection of example datasets.

Manual splits and subsets configuration

You can choose the data files to show in the Dataset Viewer for your dataset using YAML.It is useful if you want to specify which file goes into which split manually.

You can also define multiple subsets for your dataset, and pass dataset building parameters (e.g. the separator to use for CSV files).

Here is an example of a configuration defining a subset called “benchmark” with atest split.

configs:-config_name:benchmarkdata_files:-split:testpath:benchmark.csv

See the documentation onManual configuration for more information. Look also to theexample datasets.

Supported file formats

See theFile formats doc page to find the list of supported formats and recommendations for your dataset. If your dataset uses CSV or TSV files, you can find more information in theexample datasets.

Dataset Viewer size-limit errors ( TooBigContentError )

If you seeError code: TooBigContentError, then the dataset viewer could not read a preview within its limits. Common messages includeParquet error: Scan size limit exceeded andThe size of the content of the first rows exceeds the maximum supported size.

What you can do:

For Parquet files, use smaller row groups and include a page index (write_page_index=True) so the Viewer can read only what it needs.
Avoid very large values in the first rows (very long strings, large JSON blobs, base64 payloads). Move large payloads to separate files when possible.
Split very large files into smaller shards or splits, then re-upload.
If the issue remains, reviewConfigure the Dataset Viewer and open a discussion on your dataset page with the full error text.

Image, Audio and Video datasets

For image/audio/video classification datasets, you can also use directories to name the image/audio/video classes.And if your images/audio/video files have metadata (e.g. captions, bounding boxes, transcriptions, etc.), you can have metadata files next to them.

We provide two guides that you can check out:

Update on GitHub

←Datasets Download Stats File names and splits→

Movatterモバイル変換