Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Common tools for data processing

NotificationsYou must be signed in to change notification settings

CodeCreator/datatools

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

78 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Minimal scripts and reusable functions for implementing common data operations (tokenization, splitting, subsampling, packing, and more).

Built with special support forMosaic Streaming Datasets (MDS).

Table of contents

Installation

Clone this repo and install viapip install -e . or install from PyPI viapip install datatools-py.

Installation Options

  • Core installation (without Hugging Face datasets support):

    pip install datatools-py
  • Full installation (with Hugging Face datasets support):

    pip install datatools-py[datasets]# orpip install datatools-py[full]

The core installation includes all necessary dependencies for working with MDS (Mosaic Streaming Datasets), JSONL, and NumPy files. The Hugging Facedatasets library is only required if you need to load HuggingFace datasets, Arrow, or Parquet files.

Library

datatools provides core libraries that can be used to easily build custom data pipelines, specifically throughfrom datatools import load, process.

Core functions

load(path,load_options)

Loads the dataset at the path andautomatically infers its format (e.g., compressed JSON, PyArrow, MDS, etc.) based on clues from the file format and directory structure. It also supports MDS dataset over S3 and compressed MDS files (.mds.zstd,.mds.zst).

For loading from datasets hosted on the HuggingFace hub, you can use thehub input type and specify the path aspath/to/dataset>name#split.For example,load("tatsu-lab/alpaca_eval>alpaca_eval#eval") is equivalent todatasets.load_dataset("tatsu-lab/alpaca_eval", split="eval").


process(input_dataset,process_fn,output_path,options)

Processes an input dataset and writes the results to disk. It supports:

  1. Multi-processing with many CPUs, e.g.ProcessOptions(num_proc=16) (or as flag-w 16)
  2. Slurm array parallelization, e.g.ProcessOptions(slurm_array=True) (or--slurm_array) automatically sets upjob_id andnum_jobs using Slurm environment variables
  3. Custom indexing, e.g. only working on a subset--index_range 0 30 or using a custom index file--index_path path/to/index.npySeeProcessOptions for details.
  4. By default, output is written as mosaic-streaming MDS shards, which are merged into a single MDS dataset when the job finishes. The code also supports writing to JSONL files (--jsonl) and ndarray files for each column (--ndarray). The shards for these output formats are not automatically merged.

Theprocess_fn should be a function that takes one to three arguments:

  1. A subset of the data withlen(...) and.[...] access
  2. The global indices corresponding to the subset (optional)
  3. Theprocess_id for logging or sharding purposes (optional)

Example

fromdatatoolsimportload,process,ProcessOptionsfromtransformersimportAutoTokenizer# Load dataset (can be JSON, Parquet, MDS, etc.)dataset=load("path/to/dataset")# Setup tokenizer and processing functiontokenizer=AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B")deftokenize_docs(data_subset):foritemindata_subset:# Tokenize text and return dict with tokens and lengthtokens=tokenizer.encode(item["text"],add_special_tokens=False)# Chunk the text into 1024 token chunksforiinrange(0,len(tokens),1024):yield {"input_ids":tokens[i:i+1024],"length":len(tokens[i:i+1024])            }# Process dataset with 4 workers and write to diskprocess(dataset,tokenize_docs,"path/to/output",options=ProcessOptions(num_proc=4))

Scripts

datatools comes with the following default scripts:

  • tokenize: Tokenize datasets per document
  • pack: Pack tokenized documents into fixed sequences
  • peek: Print datasets as JSON to stdout
  • wrangle: Subsample, merge datasets, make random splits (e.g., train/test/validation), etc.
  • merge_index: Merge Mosaic streaming datasets in subfolders into a larger dataset

Run<script> --help for detailed arguments. Many scripts automatically include all arguments fromProcessOptions (e.g., number of processes-w <processes>) andLoadOptions.

About

Common tools for data processing

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors2

  •  
  •  

Languages


[8]ページ先頭

©2009-2025 Movatter.jp