huggingface/chugPublic

NotificationsYou must be signed in to change notification settings
Fork10
Star160

Minimal sharded dataset loaders, decoders, and utils for multi-modal document, image, and text datasets.

License

Apache-2.0 license

160 stars 10 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
src		src
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Repository files navigation

Chugging Data

A library to help w/ efficient training for multi-modal data. Initially focused on image & document + text tasks.

chug currently leverageswebdataset and Huggingfacedatasets.

webdataset tar files and dataset pipelines are preferred for scalable pretraining.

Huggingfacedatasets are supported and work great for exploration, validation, and fine-tune use cases.

Design

Submodule Hierarchy

The library has been designed so that functions, classes at different levels can be used independently.

If one wants to build a loader & pipeline with JSON/YAML serializable configs, use the top-levelchug.create_loader() inchug/loader.py. Depending on dataset sources, one can easily switch this between webdataset, HF datasets (in the future, other sources).

Bypassing the highest level, one can also callbuild_pipeline_* methods intask_pipeline and then callcreate_loader_wds with a full array of args forwds only use cases.

If one doesn't want to usechug loaders and pipelines at all,image,text, andwds (especially decoder) functionality may be useful in other projects.

Library modules (highest to lowest level)

The dependencies of modules within the library are intended to follow the hierarchy below. e.g. doc depends on wds, but wds should never depend on doc.

app|loader (chug/loader.py)|task_pipeline|doc|wds, hfds, image, text|common

Submodules

`common`

Configs, structures (dataclasses) for general use across the library

`wds`

Webdataset (wds for short) specific code. Extensions and alterations of webdataset functionality to fit covered use case and improve robustness.

All data pipelines inchug currently leveragewds pipelines, even when not usingwds datasets.

Document oriented decoding (pdf decoder) is present inchug/wds/decode.py, it can be used with any webdataset pipeline as a decoder. e.g.wds.decode(chug.wds.DecodeDoc('pill'), 'pill')

`hfds`

Huggingfacedatasets support. A minimal wrapper that allowsdatasets to be used with chug processing pipelines.

The processing pipelines remain webdataset based when usingdatasets, they are invoked by a custom collate class.

`image`

Image processing,torchvision andalbumentations based transform building code. A mix of generic image (imagenet, simclr) transforms and document specific transforms, including an implementation ofalbumentations basednougat transforms.

`text`

Text processing, tokenization code.

`doc`

Document processing code. Currently focused on processors that apply image/pdf decoders and process document OCR or VQA annotations.

`task_pipeline`

Task specific pipelines, where dataset formats meet modelling needs.

Inputs to task pipelines are sample dictionaries based on the dataset form, they are decoded and then processed into outputs that match model input requirements.

Task specific pipelines are inserted into the more generic data pipeline.

`chug.loader`

This lone top-level file includes the main factory methods for creating loaders w/ associated pipelines from config dataclasses.

`app`

Most applications usingchug will exist outside of the lib in training libraries, etc. Some builtin utility / exploration apps will be included here.

Concepts

WIP

TODOs

Nearish

Cleanup and refinement, codebase will change
Documentation & unit-tests
Support reading of info .json/.yaml files for automatic shard info resolution for webdatasets (like timm)
Support unified preprocessor functions for combined image + text tokenization (img+text token interleaving, etc.)

Longish

Increase range of task pipelines for other tasks, modelling needs
Support additional modalities & targets (video, audio, detection/dense pixel targets, image/video/audio targets)
Explore alternatives to .tar shards (array_record, arrow, etc)

Usage / Examples

Document Reading, Training w/ IDL

importchugimg_cfg=chug.ImageInputCfg(size=(1024,768),transform_type='doc_better')img_fn=chug.create_image_preprocessor(input_cfg=img_cfg,is_training=True)txt_fn=chug.create_text_preprocessor('naver-clova-ix/donut-base',prompt_end_token='<s_idl>',task_start_token='<s_idl>',# NOTE needs to be added to tokenizer)task_cfg=chug.DataTaskDocReadCfg(image_process_fn=img_fn,text_process_fn=txt_fn,page_sampling='random',error_handler='dump_and_reraise',)task_pipe=chug.create_task_pipeline(task_cfg)data_cfg=chug.DataCfg(source='pipe:curl -s -f -L https://huggingface.co/datasets/pixparse/IDL-wds/resolve/main/idl-train-0{0000..2999}.tar',batch_size=8,num_samples=3144726,format='wds',)lb=chug.create_loader(data_cfg,task_cfg,is_training=True,)ii=iter(lb)sample=next(ii)

Document Reading, Exploring IDL

importchugtask_cfg=chug.DataTaskDocReadCfg(page_sampling='all')task_pipe=chug.create_task_pipeline(task_cfg)data_cfg=chug.DataCfg(source='pixparse/IDL-wds',split='train',batch_size=None,format='hfids',num_workers=0,    )lb=chug.create_loader(data_cfg,task_cfg,)ii=iter(lb)sample=next(ii)

Document Reading, Training with PDFA

importchugimg_cfg=chug.ImageInputCfg(size=(1024,768),transform_type='doc_nougat')img_fn=chug.create_image_preprocessor(input_cfg=img_cfg,is_training=True)txt_fn=chug.create_text_preprocessor('naver-clova-ix/donut-base',prompt_end_token='<s_pdfa>',task_start_token='<s_pdfa>',# NOTE needs to be added to tokenizer)task_cfg=chug.DataTaskDocReadCfg(image_process_fn=img_fn,text_process_fn=txt_fn,page_sampling='random',)task_pipe=chug.create_task_pipeline(task_cfg)data_cfg=chug.DataCfg(source='pipe:curl -s -f -L https://huggingface.co/datasets/pixparse/pdfa-english-train/resolve/main/pdfa-eng-train-{000000..005000}.tar',batch_size=8,num_samples=1000000,# FIXME replace with actualformat='wds',   )lb=chug.create_loader(data_cfg,task_cfg,is_training=True,)ii=iter(lb)sample=next(ii)

Document Reading, Exploring PDFA

importchugtask_cfg=chug.DataTaskDocReadCfg(page_sampling='all',)task_pipe=chug.create_task_pipeline(task_cfg)data_cfg=chug.DataCfg(source='pixparse/pdfa-eng-wds',split='train',batch_size=None,format='hfids',num_workers=0,)lb=chug.create_loader(data_cfg,task_cfg,)ii=iter(lb)sample=next(ii)

Image + Text

Training

importchugimporttransformersfromfunctoolsimportpartialimg_cfg=chug.ImageInputCfg(size=(512,512),transform_type='image_timm')img_fn=chug.create_image_preprocessor(input_cfg=img_cfg,is_training=True)tokenizer=transformers.AutoTokenizer.from_pretrained('laion/CLIP-ViT-H-14-laion2B-s32B-b79K')txt_fn=partial(chug.tokenize,max_length=1000,tokenizer=tokenizer)task_cfg=chug.DataTaskImageTextCfg(image_process_fn=img_fn,text_process_fn=txt_fn,)task_pipe=chug.create_task_pipeline(task_cfg)data_cfg=chug.DataCfg(source='pipe:curl -s -f -L https://huggingface.co/datasets/pixparse/cc12m-wds/resolve/main/cc12m-train-{0000..2175}.tar',batch_size=8,num_samples=10000000,# FIXME replace with actualformat='wds',   )lb=chug.create_loader(data_cfg,task_cfg,is_training=True,)ii=iter(lb)sample=next(ii)

Document VQA

Training, Fine-tuning

importchugfromchug.task_pipelineimportcreate_task_pipelineimg_cfg=chug.ImageInputCfg(size=(1024,768),transform_type='doc_basic')img_fn=chug.create_image_preprocessor(img_cfg,is_training=True)txt_fn=chug.create_text_preprocessor('naver-clova-ix/donut-base-finetuned-docvqa',prompt_end_token='<s_answer>',task_start_token='<s_docvqa>',)task_cfg=chug.DataTaskDocVqaCfg(image_process_fn=img_fn,text_process_fn=txt_fn,)task_pipe=create_task_pipeline(task_cfg)data_cfg=chug.DataCfg(source='pipe:curl -s -f -L https://huggingface.co/datasets/pixparse/docvqa-wds/resolve/main/docvqa-train-{000..383}.tar',batch_size=8,format='wds',num_samples=39463,)lb=chug.create_loader(data_cfg,task_cfg,is_training=True,)ii=iter(lb)sample=next(ii)

Exploration

importchugfromchug.task_pipelineimportcreate_task_pipelinetask_cfg=chug.DataTaskDocVqaCfg(question_prefix='Question: ',question_suffix='',answer_prefix='Answer: ',answer_suffix='')task_pipe=create_task_pipeline(task_cfg)data_cfg=chug.DataCfg(source='pixparse/docvqa-single-page-questions',split='validation',batch_size=None,format='hfids',num_workers=0,)lb=chug.create_loader(data_cfg,task_cfg)ii=iter(lb)sample=next(ii)

About

Minimal sharded dataset loaders, decoders, and utils for multi-modal document, image, and text datasets.

Releases

No releases published

Packages

No packages published

Languages

Python100.0%

Movatterモバイル変換

License

huggingface/chug

Folders and files

Latest commit

History

Repository files navigation

Chugging Data

Design

Submodule Hierarchy

Library modules (highest to lowest level)

Submodules

common

wds

hfds

image

text

doc

task_pipeline

chug.loader

app

Concepts

TODOs

Nearish

Longish

Usage / Examples

Document Reading, Training w/ IDL

Document Reading, Exploring IDL

Document Reading, Training with PDFA

Document Reading, Exploring PDFA

Image + Text

Training

Document VQA

Training, Fine-tuning

Exploration

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages0

Uh oh!

Languages

`common`

`wds`

`hfds`

`image`

`text`

`doc`

`task_pipeline`

`chug.loader`

`app`

Packages