NVIDIA-NeMo/CuratorPublic

NotificationsYou must be signed in to change notification settings
Fork148
Star1k

Scalable data pre processing and curation toolkit for LLMs

License

Apache-2.0 license

1k stars 148 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 385 Commits
.github		.github
config		config
docs		docs
examples		examples
nemo_curator		nemo_curator
tests		tests
tutorials		tutorials
.coveragerc		.coveragerc
.gitattributes		.gitattributes
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CHANGELOG.md		CHANGELOG.md
CITATION.cff		CITATION.cff
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
SECURITY.md		SECURITY.md
codecov.yml		codecov.yml
conftest.py		conftest.py
pyproject.toml		pyproject.toml

Repository files navigation

NeMo Curator

🚀The GPU-Accelerated Open Source Framework for Efficient Generative AI Model Data Curation 🚀

NeMo Curator is a Python library specifically designed for fast and scalable dataset preparation and curation for generative AI use cases such as foundation language model pretraining, text-to-image model training, domain-adaptive pretraining (DAPT), supervised fine-tuning (SFT) and parameter-efficient fine-tuning (PEFT). It greatly accelerates data curation by leveraging GPUs withDask andRAPIDS, resulting in significant time savings. The library provides a customizable and modular interface, simplifying pipeline expansion and accelerating model convergence through the preparation of high-quality tokens.

Key Features

NeMo Curator provides a collection of scalable data curation modules for text and image curation.

Text Curation

All of our text pipelines have great multilingual support.

Download and Extraction
- Default implementations for Common Crawl, Wikipedia, and ArXiv sources
- Easily customize and extend to other sources
Language Identification
Text Cleaning
Heuristic Filtering
Classifier Filtering
- fastText
- GPU-Accelerated models:Domain (English and multilingual), Quality, Safety, Educational Content, Content Type, and Prompt Task/Complexity Classification
GPU-Accelerated Deduplication
- Exact Deduplication
- Fuzzy Deduplication via MinHash Locality Sensitive Hashing
- Semantic Deduplication
Downstream-task Decontamination
Personal Identifiable Information (PII) Redaction

Image Curation

Embedding Creation
Classifier Filtering
- Aesthetic andNSFW Classification
GPU Deduplication
- Semantic

These modules offer flexibility and permit reordering, with only a few exceptions.All the modules automatically scale to multiple nodes to increase throughput.

Resources

Get Started

This section explains how to install NeMo Curator and use the Python library, Python modules, and CLI scripts. It also includes a list of tutorials to help you get started right away. Finally, this section explains how to use the NeMo Framework Launcher as an alternative method for interfacing with NeMo Curator.

Install NeMo Curator

Requirements

Before installing NeMo Curator, ensure that the following requirements are met:

Python 3.10 or higher
- packaging >= 22.0
Ubuntu 22.04/20.04
NVIDIA GPU (optional)
- Volta™ or higher (compute capability 7.0+)
- CUDA 12 (or above)

You can get NeMo Curator in 3 ways.

PyPi
Source
NeMo Framework Container

PyPi

pip install --extra-index-url https://pypi.nvidia.com nemo-curator[all]

Source

git clone https://github.com/NVIDIA-NeMo/Curator.gitpip install --extra-index-url https://pypi.nvidia.com"./Curator[all]"

NeMo Framework Container

The latest release of NeMo Curator comes preinstalled in theNeMo Framework Container. If you want the latest commit inside the container, you can reinstall NeMo Curator using:

pip uninstall nemo-curatorrm -r /opt/Curatorgit clone https://github.com/NVIDIA-NeMo/Curator.git /opt/Curatorpip install --extra-index-url https://pypi.nvidia.com"/opt/Curator[all]"

Extras

NeMo Curator has a set of extras you can use to only install the necessary modules for your workload.These extras are available for all installation methods provided.

pip install nemo-curator# Installs CPU-only text curation modulespip install nemo-curator[dev]# Installs libraries required for developmentpip install --extra-index-url https://pypi.nvidia.com nemo-curator[cuda12x]# Installs CPU + GPU text curation modulespip install --extra-index-url https://pypi.nvidia.com nemo-curator[image]# Installs CPU + GPU text and image curation modulespip install --extra-index-url https://pypi.nvidia.com nemo-curator[all]# Installs all of the above

Using Nightly Dependencies for RAPIDS

You can also install NeMo Curator using theRAPIDS Nightly Builds:

# Installing from PyPipip install --extra-index-url=https://pypi.anaconda.org/rapidsai-wheels-nightly/simple"nemo-curator[cuda12x_nightly]"# Installing from sourcepip install --extra-index-url=https://pypi.anaconda.org/rapidsai-wheels-nightly/simple"./Curator[cuda12x_nightly]"

For the image curation modules and all modules, you can use[image_nightly] and[all_nightly], respectively.

Use NeMo Curator

Python API Quick Example

The following snippet demonstrates how to create a small data curation pipeline that downloads and curates a small subset of the Common Crawl dataset.

# Download your datasetdataset=download_common_crawl("/datasets/common_crawl/","2021-04","2021-10",url_limit=10)# Build your pipelinecuration_pipeline=Sequential([# Fix unicodeModify(UnicodeReformatter()),# Discard short recordsScoreFilter(WordCountFilter(min_words=80)),# Discard low-quality recordsScoreFilter(FastTextQualityFilter(model_path="model.bin")),# Discard records from the evaluation metrics to prevent test set leakage.TaskDecontamination([Winogrande(),Squad(),TriviaQA()])])# Execute the pipeline on your datasetcurated_dataset=curation_pipeline(dataset)

Explore NeMo Curator Tutorials

To get started with NeMo Curator, you can follow the tutorialsavailable here. These tutorials include:

tinystories which focuses on data curation for training LLMs from scratch.
peft-curation which focuses on data curation for LLM parameter-efficient fine-tuning (PEFT) use-cases.
distributed_data_classification which demonstrates how to use NVIDIA's Hugging Face classifiers to help with data annotation.
single_node_tutorial which demonstrates an end-to-end data curation pipeline for curating Wikipedia data in Thai.
image-curation which explores the scalable image curation modules.

Access Python Modules

The NeMo Curator section of theNeMo Framework User Guide provides in-depth information about how the Python modules work. Theexamples directory in the GitHub repository provides scripts that showcase these modules.

Use CLI Scripts

NeMo Curator also offers CLI scripts for you to use. The scripts innemo_curator/scripts map closely to the supplied Python modules. Refer to theNeMo Framework User Guide for more information about the Python modules and scripts.

Use NeMo Framework Launcher

As an alternative method for interfacing with NeMo Curator, you can use theNeMo Framework Launcher. The launcher enables you to easily configure the parameters and cluster. It can also automatically generate the Slurm batch scripts that wrap around the CLI scripts required to run your pipeline.

In addition, other methods are available to run NeMo Curator on Slurm. For example, refer to the example scripts inexamples/slurm for information on how to run NeMo Curator on Slurm without the NeMo Framework Launcher.

Module Ablation and Compute Performance

The modules within NeMo Curator were primarily designed to curate high-quality documents from Common Crawl snapshots in a scalable manner. To evaluate the quality of the curated Common Crawl documents, we conducted a series of ablation experiments. In these experiments, we trained a 357M-parameter GPT-style model using datasets generated at various stages of our data curation pipeline, which was implemented in NeMo Curator.

The following figure shows that the use of different data curation modules implemented in NeMo Curator led to improved model zero-shot downstream task performance.

In terms of scalability and compute performance, using the combination of RAPIDS and Dask fuzzy deduplication enabled us to deduplicate the 1.96 Trillion token subset of the RedPajama V2 dataset in 0.5 hours with 32 NVIDIA H100 GPUs.

Processing Time	Comparison to Alternative Libraries

Additionally, using the CPU-based modules, the following table shows the time required and resulting data size reduction for each processing stepCommon Crawl snapshot from November/December of 2020 using 30 CPU nodes (with hardware similar to thec5.24xlargeAmazon AWS C5 instance).

Dataset	Download and text extraction		Text cleaning		Quality filtering
	Time	Output Size	Time	Output Size	Time	Output Size
Common Crawl 2020-50	36 hrs	2.8 TB	1 hr	2.8 TB	0.2 hr	0.52 TB

Contribute to NeMo Curator

We welcome community contributions! Please refer toCONTRIBUTING.md for the process.

About

Scalable data pre processing and curation toolkit for LLMs

Releases17

NVIDIA NeMo Curator 0.8.0 Latest

May 9, 2025

+ 16 releases

Packages

No packages published

Contributors39

+ 25 contributors

Languages

Python99.8%
Other0.2%

Movatterモバイル変換

License

NVIDIA-NeMo/Curator

Folders and files

Latest commit

History

Repository files navigation

NeMo Curator

Key Features

Text Curation

Image Curation

Resources

Get Started

Install NeMo Curator

Requirements

PyPi

Source

NeMo Framework Container

Extras

Using Nightly Dependencies for RAPIDS

Use NeMo Curator

Python API Quick Example

Explore NeMo Curator Tutorials

Access Python Modules

Use CLI Scripts

Use NeMo Framework Launcher

Module Ablation and Compute Performance

Contribute to NeMo Curator

About

Topics

Resources

License

Security policy

Uh oh!

Stars

Watchers

Forks

Releases17

Packages0

Uh oh!

Contributors39

Uh oh!

Languages

Packages