- Notifications
You must be signed in to change notification settings - Fork3
CodeCreator/datatools
Folders and files
| Name | Name | Last commit message | Last commit date | |
|---|---|---|---|---|
Repository files navigation
Minimal scripts and reusable functions for implementing common data operations (tokenization, splitting, subsampling, packing, and more).
Built with special support forMosaic Streaming Datasets (MDS).
Clone this repo and install viapip install -e . or install from PyPI viapip install datatools-py.
Core installation (without Hugging Face datasets support):
pip install datatools-py
Full installation (with Hugging Face datasets support):
pip install datatools-py[datasets]# orpip install datatools-py[full]
The core installation includes all necessary dependencies for working with MDS (Mosaic Streaming Datasets), JSONL, and NumPy files. The Hugging Facedatasets library is only required if you need to load HuggingFace datasets, Arrow, or Parquet files.
datatools provides core libraries that can be used to easily build custom data pipelines, specifically throughfrom datatools import load, process.
load(path,load_options)
Loads the dataset at the path andautomatically infers its format (e.g., compressed JSON, PyArrow, MDS, etc.) based on clues from the file format and directory structure. It also supports MDS dataset over S3 and compressed MDS files (.mds.zstd,.mds.zst).
For loading from datasets hosted on the HuggingFace hub, you can use thehub input type and specify the path aspath/to/dataset>name#split.For example,load("tatsu-lab/alpaca_eval>alpaca_eval#eval") is equivalent todatasets.load_dataset("tatsu-lab/alpaca_eval", split="eval").
process(input_dataset,process_fn,output_path,options)
Processes an input dataset and writes the results to disk. It supports:
- Multi-processing with many CPUs, e.g.
ProcessOptions(num_proc=16)(or as flag-w 16) - Slurm array parallelization, e.g.
ProcessOptions(slurm_array=True)(or--slurm_array) automatically sets upjob_idandnum_jobsusing Slurm environment variables - Custom indexing, e.g. only working on a subset
--index_range 0 30or using a custom index file--index_path path/to/index.npySeeProcessOptions for details. - By default, output is written as mosaic-streaming MDS shards, which are merged into a single MDS dataset when the job finishes. The code also supports writing to JSONL files (
--jsonl) and ndarray files for each column (--ndarray). The shards for these output formats are not automatically merged.
Theprocess_fn should be a function that takes one to three arguments:
- A subset of the data with
len(...)and.[...]access - The global indices corresponding to the subset (optional)
- The
process_idfor logging or sharding purposes (optional)
fromdatatoolsimportload,process,ProcessOptionsfromtransformersimportAutoTokenizer# Load dataset (can be JSON, Parquet, MDS, etc.)dataset=load("path/to/dataset")# Setup tokenizer and processing functiontokenizer=AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B")deftokenize_docs(data_subset):foritemindata_subset:# Tokenize text and return dict with tokens and lengthtokens=tokenizer.encode(item["text"],add_special_tokens=False)# Chunk the text into 1024 token chunksforiinrange(0,len(tokens),1024):yield {"input_ids":tokens[i:i+1024],"length":len(tokens[i:i+1024]) }# Process dataset with 4 workers and write to diskprocess(dataset,tokenize_docs,"path/to/output",options=ProcessOptions(num_proc=4))
datatools comes with the following default scripts:
tokenize: Tokenize datasets per documentpack: Pack tokenized documents into fixed sequencespeek: Print datasets as JSON to stdoutwrangle: Subsample, merge datasets, make random splits (e.g., train/test/validation), etc.merge_index: Merge Mosaic streaming datasets in subfolders into a larger dataset
Run<script> --help for detailed arguments. Many scripts automatically include all arguments fromProcessOptions (e.g., number of processes-w <processes>) andLoadOptions.
About
Common tools for data processing
Resources
Uh oh!
There was an error while loading.Please reload this page.
Stars
Watchers
Forks
Releases
Packages0
Uh oh!
There was an error while loading.Please reload this page.
Contributors2
Uh oh!
There was an error while loading.Please reload this page.