Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Commit0999db2

Browse files
authored
Merge pull request#3 from huggingface/mega_refactor
Bam!
2 parentscfb1688 +5e60168 commit0999db2

File tree

51 files changed

+4458
-763
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

51 files changed

+4458
-763
lines changed

‎README.md‎

Lines changed: 228 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,2 +1,230 @@
11
#Chugging Data
22

3+
A library to help w/ efficient training with multi-modal data. Initially focused on image & document + text tasks.
4+
5+
`chug` currently leverages`webdataset` and Huggingface`datasets`.`webdataset` tar files and dataset pipelines are preferred for scalable pretraining. For ease of use, Huggingface`datasets` are also supported and work great for exploration, validation, and fine-tune use cases.
6+
7+
##TODOs
8+
9+
###Nearish
10+
* Cleanup and refinement, codebase will change.
11+
* Documentation & unit-tests.
12+
* Support reading of info .json/.yaml files for automatic shard info resolution for webdatasets (like timm).
13+
* Support unified preprocessor functions for combined image + text tokenization (img+text token interleaving, etc.).
14+
15+
###Longish
16+
* Increase range of task pipelines for other tasks, modelling needs.
17+
* Support additional modalities & targets (video, audio, detection/dense pixel targets, image/video/audio targets).
18+
* Explore alternatives to .tar shards (array_record, arrow, etc).
19+
20+
##Usage / Examples
21+
22+
###Document Reading, Training w/ IDL
23+
```python
24+
import chug
25+
img_cfg= chug.ImageInputCfg(size=(1024,768),transform_type='doc_better')
26+
img_fn= chug.create_image_preprocessor(input_cfg=img_cfg,is_training=True)
27+
txt_fn= chug.create_text_preprocessor(
28+
'naver-clova-ix/donut-base',
29+
prompt_end_token='<s_idl>',
30+
task_start_token='<s_idl>',#NOTE needs to be added to tokenizer
31+
)
32+
33+
task_cfg= chug.DataTaskDocReadCfg(
34+
image_process_fn=img_fn,
35+
text_process_fn=txt_fn,
36+
page_sampling='random',
37+
error_handler='dump_and_reraise',
38+
)
39+
task_pipe= chug.create_task_pipeline(task_cfg)
40+
data_cfg= chug.DataCfg(
41+
source='pipe:curl -s -f -L https://huggingface.co/datasets/pixparse/IDL-wds/resolve/main/idl-train-0{0000..1000}.tar',#FIXME range
42+
split='train',
43+
batch_size=8,
44+
num_samples=1000000,#FIXME get actual value
45+
num_workers=0,
46+
format='wds',
47+
)
48+
lb= chug.create_loader(
49+
data_cfg,
50+
task_cfg,
51+
is_training=True,
52+
)
53+
ii=iter(lb.loader)
54+
sample=next(ii)
55+
```
56+
57+
###Document Reading, Exploring IDL
58+
```python
59+
import chug
60+
task_cfg= chug.DataTaskDocReadCfg(page_sampling='all')
61+
task_pipe= chug.create_task_pipeline(task_cfg)
62+
63+
data_cfg= chug.DataCfg(
64+
source='pixparse/IDL-wds',
65+
split='train',
66+
batch_size=None,
67+
format='hfids',
68+
num_workers=0,
69+
)
70+
lb= chug.create_loader(
71+
data_cfg,
72+
task_cfg,
73+
)
74+
ii=iter(lb.loader)
75+
sample=next(ii)
76+
```
77+
78+
###Document Reading, Training with PDFA
79+
80+
```python
81+
import chug
82+
img_cfg= chug.ImageInputCfg(size=(1024,768),transform_type='doc_nougat')
83+
img_fn= chug.create_image_preprocessor(input_cfg=img_cfg,is_training=True)
84+
txt_fn= chug.create_text_preprocessor(
85+
'naver-clova-ix/donut-base',
86+
prompt_end_token='<s_pdfa>',
87+
task_start_token='<s_pdfa>',#NOTE needs to be added to tokenizer
88+
)
89+
90+
task_cfg= chug.DataTaskDocReadCfg(
91+
image_process_fn=img_fn,
92+
text_process_fn=txt_fn,
93+
page_sampling='random',
94+
)
95+
task_pipe= chug.create_task_pipeline(task_cfg)
96+
data_cfg= chug.DataCfg(
97+
source='pipe:curl -s -f -L https://huggingface.co/datasets/pixparse/pdfa-english-train/resolve/main/pdfa-eng-train-{000000..005000}.tar',
98+
split='train',
99+
batch_size=8,
100+
num_samples=1000000,#FIXME approx
101+
format='wds',
102+
)
103+
lb= chug.create_loader(
104+
data_cfg,
105+
task_cfg,
106+
is_training=True,
107+
)
108+
ii=iter(lb.loader)
109+
sample=next(ii)
110+
```
111+
112+
###Document Reading, Exploring PDFA
113+
114+
```python
115+
import chug
116+
117+
task_cfg= chug.DataTaskDocReadCfg(
118+
page_sampling='all',
119+
)
120+
task_pipe= chug.create_task_pipeline(task_cfg)
121+
data_cfg= chug.DataCfg(
122+
source='pixparse/pdfa-eng-wds',
123+
split='train',
124+
batch_size=None,
125+
format='hfids',
126+
num_workers=0,
127+
)
128+
lb= chug.create_loader(
129+
data_cfg,
130+
task_cfg,
131+
)
132+
ii=iter(lb.loader)
133+
sample=next(ii)
134+
```
135+
136+
137+
###Image + Text
138+
139+
###Training
140+
141+
```python
142+
import chug
143+
import transformers
144+
from functoolsimport partial
145+
img_cfg= chug.ImageInputCfg(size=(512,512),transform_type='image_timm')
146+
img_fn= chug.create_image_preprocessor(input_cfg=img_cfg,is_training=True)
147+
tokenizer= transformers.AutoTokenizer.from_pretrained('laion/CLIP-ViT-H-14-laion2B-s32B-b79K')
148+
txt_fn= partial(chug.tokenize,max_length=1000,tokenizer=tokenizer)
149+
task_cfg= chug.DataTaskImageTextCfg(
150+
image_process_fn=img_fn,
151+
text_process_fn=txt_fn,
152+
)
153+
task_pipe= chug.create_task_pipeline(task_cfg)
154+
data_cfg= chug.DataCfg(
155+
source='pipe:curl -s -f -L https://huggingface.co/datasets/pixparse/cc12m-wds/resolve/main/cc12m-train-{0000..2175}.tar',
156+
split='train',
157+
batch_size=8,
158+
num_samples=10000000,
159+
format='wds',
160+
)
161+
lb= chug.create_loader(
162+
data_cfg,
163+
task_cfg,
164+
is_training=True,
165+
)
166+
ii=iter(lb.loader)
167+
sample=next(ii)
168+
```
169+
170+
###Document VQA
171+
172+
####Training, Fine-tuning
173+
```python
174+
import chug
175+
from chug.task_pipelineimport create_task_pipeline
176+
img_cfg= chug.ImageInputCfg(size=(1024,768),transform_type='doc_basic')
177+
img_fn= chug.create_image_preprocessor(img_cfg,is_training=True)
178+
txt_fn= chug.create_text_preprocessor(
179+
'naver-clova-ix/donut-base-finetuned-docvqa',
180+
prompt_end_token='<s_answer>',
181+
task_start_token='<s_docvqa>',
182+
)
183+
184+
task_cfg= chug.DataTaskDocVqaCfg(
185+
image_process_fn=img_fn,
186+
text_process_fn=txt_fn,
187+
)
188+
task_pipe= create_task_pipeline(task_cfg)
189+
190+
data_cfg= chug.DataCfg(
191+
source='pipe:curl -s -f -L https://huggingface.co/datasets/pixparse/docvqa-wds/resolve/main/docvqa-train-{000..383}.tar',
192+
split='train',
193+
batch_size=8,
194+
format='wds',
195+
num_samples=39463,
196+
)
197+
lb= chug.create_loader(
198+
data_cfg,
199+
task_cfg,
200+
is_training=True,
201+
)
202+
ii=iter(lb.loader)
203+
sample=next(ii)
204+
```
205+
206+
####Exploration
207+
208+
```python
209+
import chug
210+
from chug.task_pipelineimport create_task_pipeline
211+
task_cfg= chug.DataTaskDocVqaCfg(
212+
question_prefix='Question:',
213+
question_suffix='',
214+
answer_prefix='Answer:',
215+
answer_suffix=''
216+
)
217+
task_pipe= create_task_pipeline(task_cfg)
218+
data_cfg= chug.DataCfg(
219+
source='pixparse/docvqa-single-page-questions',
220+
split='validation',
221+
batch_size=None,
222+
format='hfids',
223+
num_workers=0,
224+
)
225+
lb= chug.create_loader(
226+
data_cfg,
227+
task_cfg
228+
)
229+
ii=iter(lb.loader)
230+
```

‎pyproject.toml‎

Lines changed: 18 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -9,29 +9,41 @@ authors = [
99
]
1010
description =""
1111
readme ="README.md"
12-
requires-python =">=3.7"
13-
keywords = ["webdataset","dataset","sharded","cluster","scale"]
12+
requires-python =">=3.8"
13+
keywords = ["webdataset","datasets","sharded","cluster","scale","documents"]
1414
license = {text ="Apache-2.0"}
1515
classifiers = [
16-
'Development Status ::4 -Beta',
16+
'Development Status ::3 -Alpha',
1717
'Intended Audience :: Education',
1818
'Intended Audience :: Science/Research',
1919
'License :: OSI Approved :: Apache Software License',
20-
'Programming Language :: Python :: 3.7',
2120
'Programming Language :: Python :: 3.8',
2221
'Programming Language :: Python :: 3.9',
2322
'Programming Language :: Python :: 3.10',
2423
'Programming Language :: Python :: 3.11',
24+
'Programming Language :: Python :: 3.12',
2525
'Topic :: Scientific/Engineering',
2626
'Topic :: Scientific/Engineering :: Artificial Intelligence',
2727
'Topic :: Software Development',
2828
'Topic :: Software Development :: Libraries',
2929
'Topic :: Software Development :: Libraries :: Python Modules',
3030
]
3131
dependencies = [
32-
"webdataset",
33-
'importlib-metadata; python_version<"3.8"',
32+
"webdataset",
33+
"timm",
34+
"torch",
35+
"simple_parsing",
36+
"pypdfium2",
37+
'importlib-metadata; python_version<"3.8"',
3438
]
39+
40+
[project.optional-dependencies]
41+
# albumentations (nougat augs)
42+
alb = [
43+
"albumentations",
44+
'cv2',
45+
]
46+
3547
dynamic = ["version"]
3648

3749
[tool.pdm.version]

‎requirements.txt‎

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,2 +1,6 @@
11
torch
2-
webdataset
2+
timm
3+
webdataset
4+
datasets
5+
pypdfium2
6+
simple_parsing

‎src/chug/__init__.py‎

Lines changed: 44 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,46 @@
1-
from .webdatasetimportcreate_wds_loader,create_doc_anno_pipe,create_image_text_pipe
1+
from .commonimport (
2+
ImageInputCfg,
3+
ImageAugCfg,
4+
LoaderBundle,
5+
ImageFeatureInfo,
6+
FeatureInfo,
7+
ShardSpec,
8+
SourceSpec,
9+
DataArg,
10+
DataCfg,
11+
DistributedCfg,
12+
)
13+
from .hfdsimportcreate_loader_hf
14+
from .imageimport (
15+
build_image_transforms,
16+
build_transforms_image_basic,
17+
build_transforms_image_timm,
18+
build_transforms_doc_basic,
19+
build_transforms_doc_better,
20+
build_transforms_doc_nougat,
21+
create_image_preprocessor,
22+
)
23+
from .loaderimportcreate_loader,create_loader_from_config_hf,create_loader_from_config_wds
24+
from .task_pipelineimport (
25+
create_task_pipeline,
26+
build_task_pipeline_doc_read,
27+
build_task_pipeline_doc_vqa,
28+
build_task_pipeline_gtparse,
29+
build_task_pipeline_image_text,
30+
build_task_pipeline_manual,
31+
DataTaskDocReadCfg,
32+
DataTaskDocVqaCfg,
33+
DataTaskImageTextCfg,
34+
DataTaskManualCfg,
35+
)
36+
from .textimporttokenize,text_input_to_target,prepare_text_input,create_text_preprocessor
237
from .versionimport__version__
38+
from .wdsimport (
39+
create_loader_wds,
40+
build_data_pipeline,
41+
decode_image_pages,
42+
decode_pdf_pages,
43+
create_image_decoder,
44+
DecodeDoc,
45+
)
346

‎src/chug/app/test.py‎

Lines changed: 37 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,37 @@
1+
importlogging
2+
importos
3+
fromdataclassesimportdataclass,replace
4+
fromdatetimeimportdatetime
5+
frompprintimportpprint
6+
fromtypingimportDict,Optional,Union
7+
8+
importsimple_parsing
9+
10+
fromchug.commonimportImageInputCfg,ImageAugCfg,DataArg
11+
fromchug.wdsimportcreate_loader_wds
12+
13+
@dataclass
14+
classTestArgs:
15+
data:DataArg
16+
# FIXME need TaskArg form to define subset of task cfg options from command line
17+
input:ImageInputCfg
18+
aug:ImageAugCfg
19+
20+
21+
defmain():
22+
args=simple_parsing.parse(
23+
TestArgs,
24+
add_option_string_dash_variants=simple_parsing.DashVariant.DASH,
25+
argument_generation_mode=simple_parsing.ArgumentGenerationMode.BOTH,
26+
add_config_path_arg=True,
27+
)
28+
29+
pprint(args)
30+
31+
loader=create_loader_wds(...)
32+
33+
# FIXME WIP app to demo iteration / analysis for supported datasets
34+
35+
36+
if__name__=='__main__':
37+
main()

‎src/chug/common/__init__.py‎

Lines changed: 9 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1,9 @@
1-
from .typesimportSharedCount,LoaderBundle
1+
from .collateimportcollate
2+
from .configimportImageInputCfg,ImageAugCfg,PreprocessCfg,image_mode_to_chs
3+
from .configimportDataArg,DataCfg,DistributedCfg,source_to_shard_spec
4+
from .typesimportSourceSpec,ShardSpec,SharedCount,LoaderBundle,FeatureInfo,ImageFeatureInfo
5+
from .urlsimportexpand_urls
6+
7+
# FIXME uncertain types
8+
from .typesimportSplitInfo,ShardSpec
9+
from .task_configimportDataTaskCfg

0 commit comments

Comments
 (0)

[8]ページ先頭

©2009-2025 Movatter.jp