Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Minimal sharded dataset loaders, decoders, and utils for multi-modal document, image, and text datasets.

License

NotificationsYou must be signed in to change notification settings

huggingface/chug

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

36 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

A library to help w/ efficient training for multi-modal data. Initially focused on image & document + text tasks.

chug currently leverageswebdataset and Hugging Facedatasets.

webdataset tar files and dataset pipelines are preferred for scalable pretraining.

Hugging Facedatasets are supported and work great for exploration, validation, and fine-tune use cases.

chug provides on the fly PDF decoding and rendering via either pypdfium2 (https://github.com/pypdfium2-team/pypdfium2) as a default, or fitz/pymupdf (https://github.com/pymupdf/PyMuPDF) if your use case is okay with their AGPL-3.0 license.fitz support must be manually enabled. The pdf handling is implemented at the webdataset level, so you can plug it in to other webdataset pipelines. This enables large scale sharded streaming of native .pdf files without needing to pre-render to .png/.tiff, etc.

Status

This library is still a WIP, consider this an alpha release (pre announcement). Major features should be working, the library has been tested with several PDF datasets we will shortly make public. However, do expect there will still be breaking changes, lots of improvements, etc.

pip install --pre chug will install the current dev version.

TODOs

Nearish

  • Cleanup and refinement, codebase will change
  • Documentation & unit-tests
  • Support reading of info .json/.yaml files for automatic shard info resolution for webdatasets (like timm)

Mediumish

  • Option to output bbox annotations for lines (or word + word output) for tasks that leverage layout
  • Unified preprocessor functions for combined image + text tokenization (img+text token interleaving, etc.)
  • Image token (patch) packing ala NaViT. Online bin packing based algorithms integrated with image preprocessing and pipeline.

Longish

  • Increase range of task pipelines for other tasks, modelling needs
  • Support additional modalities & targets (video, audio, detection/dense pixel targets, image/video/audio targets)
  • Explore alternatives to .tar shards (array_record, arrow, etc)

Design

Submodule Hierarchy

The library has been designed so that functions, classes at different levels can be used independently.

If one wants to build a loader & pipeline with JSON/YAML serializable configs, use the top-levelchug.create_loader() inchug/loader.py. Depending on dataset sources, one can easily switch this between webdataset, HF datasets (in the future, other sources).

Bypassing the highest level, one can also callbuild_pipeline_* methods intask_pipeline and then callcreate_loader_wds with a full array of args forwds only use cases.

If one doesn't want to usechug loaders and pipelines at all,image,text, andwds (especially decoder) functionality may be useful in other projects.

Library modules (highest to lowest level)

The dependencies of modules within the library are intended to follow the hierarchy below. e.g. doc depends on wds, but wds should never depend on doc.

app|loader (chug/loader.py)|task_pipeline|doc|wds, hfds, image, text|common

Submodules

common

Configs, structures (dataclasses) for general use across the library

wds

Webdataset (wds for short) specific code. Extensions and alterations of webdataset functionality to fit covered use case and improve robustness.

All data pipelines inchug currently leveragewds pipelines, even when not usingwds datasets.

Document oriented decoding (pdf decoder) is present inchug/wds/decode.py, it can be used with any webdataset pipeline as a decoder. e.g.wds.decode(chug.wds.DecodeDoc('pill'), 'pill')

hfds

Hugging Facedatasets support. A minimal wrapper that allowsdatasets to be used with chug processing pipelines.

The processing pipelines remain webdataset based when usingdatasets, they are invoked by a custom collate class.

image

Image processing,torchvision andalbumentations based transform building code. A mix of generic image (imagenet, simclr) transforms and document specific transforms, including an implementation ofalbumentations basednougat transforms.

text

Text processing, tokenization code.

doc

Document processing code. Currently focused on processors that apply image/pdf decoders and process document OCR or VQA annotations.

task_pipeline

Task specific pipelines, where dataset formats meet modelling needs.

Inputs to task pipelines are sample dictionaries based on the dataset form, they are decoded and then processed into outputs that match model input requirements.

Task specific pipelines that handle the data <--> model input interface are inserted into an encompassing data pipeline which handles shard lists, shuffle, wrapping, distributed worker, splitting, batching, etc.

chug.loader

This lone top-level file includes the main factory methods for creating loaders w/ associated pipelines from config dataclasses.

app

Most applications usingchug will exist outside of the lib in training libraries, etc. Some builtin utility / exploration apps will be included here.

Concepts

WIP

Datasets

Datasets that work well with this library can be found on the Hugging Face Hub under thepixparse organization (https://huggingface.co/pixparse).

We'll add links to other noteworthy datasets that can be used as we become aware of them.

Usage / Examples

Document Reading, Training w/ IDL

importchugimg_cfg=chug.ImageInputCfg(size=(1024,768),transform_type='doc_better')img_fn=chug.create_image_preprocessor(input_cfg=img_cfg,is_training=True)txt_fn=chug.create_text_preprocessor('naver-clova-ix/donut-base',prompt_end_token='<s_idl>',task_start_token='<s_idl>',# NOTE needs to be added to tokenizer)task_cfg=chug.DataTaskDocReadCfg(image_process_fn=img_fn,text_process_fn=txt_fn,page_sampling='random',error_handler='dump_and_reraise',)data_cfg=chug.DataCfg(source='pipe:curl -s -f -L https://huggingface.co/datasets/pixparse/idl-wds/resolve/main/idl-train-0{0000..2999}.tar',batch_size=8,num_samples=3144726,format='wds',)lb=chug.create_loader(data_cfg,task_cfg,is_training=True,)ii=iter(lb)sample=next(ii)

Document Reading, Exploring IDL

importchugtask_cfg=chug.DataTaskDocReadCfg(page_sampling='all')data_cfg=chug.DataCfg(source='pixparse/idl-wds',split='train',batch_size=None,format='hfids',num_workers=0,    )lb=chug.create_loader(data_cfg,task_cfg,)ii=iter(lb)sample=next(ii)

Document Reading, Training with PDFA

importchugimg_cfg=chug.ImageInputCfg(size=(1024,768),transform_type='doc_nougat')img_fn=chug.create_image_preprocessor(input_cfg=img_cfg,is_training=True)txt_fn=chug.create_text_preprocessor('naver-clova-ix/donut-base',prompt_end_token='<s_pdfa>',task_start_token='<s_pdfa>',# NOTE needs to be added to tokenizer)task_cfg=chug.DataTaskDocReadCfg(image_process_fn=img_fn,text_process_fn=txt_fn,page_sampling='random',)data_cfg=chug.DataCfg(source='pipe:curl -s -f -L https://huggingface.co/datasets/pixparse/pdfa-english-train/resolve/main/pdfa-eng-train-{000000..005000}.tar',batch_size=8,num_samples=1000000,# FIXME replace with actualformat='wds',   )lb=chug.create_loader(data_cfg,task_cfg,is_training=True,)ii=iter(lb)sample=next(ii)

Document Reading, Exploring PDFA

importchugtask_cfg=chug.DataTaskDocReadCfg(page_sampling='all',)data_cfg=chug.DataCfg(source='pixparse/pdfa-eng-wds',split='train',batch_size=None,format='hfids',num_workers=0,)lb=chug.create_loader(data_cfg,task_cfg,)ii=iter(lb)sample=next(ii)

Image + Text

Training

importchugimporttransformersfromfunctoolsimportpartialimg_cfg=chug.ImageInputCfg(size=(512,512),transform_type='image_timm')img_fn=chug.create_image_preprocessor(input_cfg=img_cfg,is_training=True)tokenizer=transformers.AutoTokenizer.from_pretrained('laion/CLIP-ViT-H-14-laion2B-s32B-b79K')txt_fn=partial(chug.tokenize,max_length=1000,tokenizer=tokenizer)task_cfg=chug.DataTaskImageTextCfg(image_process_fn=img_fn,text_process_fn=txt_fn,)data_cfg=chug.DataCfg(source='pipe:curl -s -f -L https://huggingface.co/datasets/pixparse/cc12m-wds/resolve/main/cc12m-train-{0000..2175}.tar',batch_size=8,num_samples=10968539,format='wds',   )lb=chug.create_loader(data_cfg,task_cfg,is_training=True,)ii=iter(lb)sample=next(ii)

Document VQA

Training, Fine-tuning

importchugfromchug.task_pipelineimportcreate_task_pipelineimg_cfg=chug.ImageInputCfg(size=(1024,768),transform_type='doc_basic')img_fn=chug.create_image_preprocessor(img_cfg,is_training=True)txt_fn=chug.create_text_preprocessor('naver-clova-ix/donut-base-finetuned-docvqa',prompt_end_token='<s_answer>',task_start_token='<s_docvqa>',)task_cfg=chug.DataTaskDocVqaCfg(image_process_fn=img_fn,text_process_fn=txt_fn,)data_cfg=chug.DataCfg(source='pipe:curl -s -f -L https://huggingface.co/datasets/pixparse/docvqa-wds/resolve/main/docvqa-train-{000..383}.tar',batch_size=8,format='wds',num_samples=39463,)lb=chug.create_loader(data_cfg,task_cfg,is_training=True,)ii=iter(lb)sample=next(ii)

Exploration

importchugfromchug.task_pipelineimportcreate_task_pipelinetask_cfg=chug.DataTaskDocVqaCfg(question_prefix='Question: ',question_suffix='',answer_prefix='Answer: ',answer_suffix='')data_cfg=chug.DataCfg(source='pixparse/docvqa-single-page-questions',split='validation',batch_size=None,format='hfids',num_workers=0,)lb=chug.create_loader(data_cfg,task_cfg)ii=iter(lb)sample=next(ii)

Acknowledgement

chug evolve from thewebdataset datapipeline used successfully in theOpenCLIP project. Thanks to all the contributors in that project. Future work will likely involve closing the loop and leveragingchug in OpenCLIP for increased capability.

The image/document augmentations inchug rely on a number of external influences. Our document orienteddoc_better torchvision augmentations are influenced bynougat, and thedoc_nougat is a direct adaptation of thealbumentations +cv2 document pipeline innougat. Several image augmentations leverage existing work in thetimm library.

Also, big thanks to the maintainers ofwebdataset and Hugging Facedatasets.

About

Minimal sharded dataset loaders, decoders, and utils for multi-modal document, image, and text datasets.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages


[8]ページ先頭

©2009-2025 Movatter.jp