Movatterモバイル変換

Part of the book series:Lecture Notes in Computer Science ((LNCS,volume 14804))

Included in the following conference series:

International Conference on Document Analysis and Recognition

983Accesses

Abstract

Self-supervised multi-modal document pre-training for document knowledge learning shows superiority in various downstream tasks. However, due to the diversity of document languages and structures, there is still room to better model various document layouts while efficiently utilizing the pre-trained language models. To this goal, this paper proposes a Graph-based Multi-level Layout Language-independent Model (GraphMLLM) which uses dual-stream structure to explore textual and layout information separately and cooperatively. Specifically, GraphMLLM consists of a text stream which uses off-the-shelf pre-trained language model to explore textual semantics and a layout stream which uses multi-level graph neural network (GNN) to model hierarchical page layouts. Through the cooperation of the text stream and layout stream, GraphMLLM can model multi-level page layouts more comprehensively and improve the performance of language-independent document pre-trained model. Experimental results show that compared with previous state-of-the-art methods, GraphMLLM yields higher performance on downstream visual information extraction (VIE) tasks after pre-training on less documents. Code and model will be available athttps://github.com/HSDai/GraphMLLM.

This is a preview of subscription content,log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

¥17,985 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Chapter: JPY 3498; Price includes VAT (Japan)

eBook: JPY 8465; Price includes VAT (Japan)

Softcover Book: JPY 10581; Price includes VAT (Japan)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

LayoutGCN: A Lightweight Architecture for Visually Rich Document Understanding

DistilDoc: Knowledge Distillation for Visually-Rich Document Applications

LAPDoc: Layout-Aware Prompting for Documents

Notes

1.
http://www.resensetech.com

References

Appalaraju, S., Jasani, B., Kota, B.U., Xie, Y., Manmatha, R.: Docformer: end-to-end transformer for document understanding. In: International Conference on Computer Vision, pp. 973–983 (2021)
Google Scholar
Bao, H., Dong, L., Piao, S., Wei, F.: Beit: bert pre-training of image transformers. In: International Conference on Learning Representations (2021)
Google Scholar
Chi, Z., et al.: InfoXLM: an information-theoretic framework for cross-lingual language model pre-training. In: 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 3576–3588 (2021)
Google Scholar
Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 4171–4186 (2019)
Google Scholar
Gu, Z., et al.: Xylayoutlm: towards layout-aware multimodal networks for visually-rich document understanding. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4583–4592 (2022)
Google Scholar
He, P., Liu, X., Gao, J., Chen, W.: Deberta: decoding-enhanced Bert with disentangled attention. In: The 9th International Conference on Learning Representations (2021)
Google Scholar
Hong, T., Kim, D., Ji, M., Hwang, W., Nam, D., Park, S.: BROS: a pre-trained language model focusing on text and layout for better key information extraction from documents. In: The 36th AAAI Conference on Artificial Intelligence, pp. 10767–10775 (2022)
Google Scholar
Huang, Y., Lv, T., Cui, L., Lu, Y., Wei, F.: Layoutlmv3: pre-training for document AI with unified text and image masking. In: The 30th ACM International Conference on Multimedia, pp. 4083–4091 (2022)
Google Scholar
Jaume, G., Ekenel, H.K., Thiran, J.: FUNSD: a dataset for form understanding in noisy scanned documents. In: The 2nd International Workshop on Open Services and Tools for Document Analysis, pp. 1–6 (2019)
Google Scholar
Lewis, D.D., Agam, G., Argamon, S., Frieder, O., Grossman, D.A., Heard, J.: Building a test collection for complex document information processing. In: The 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 665–666 (2006)
Google Scholar
Li, C., et al.: StructuralLM: structural pre-training for form understanding. In: The 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, vol. 1: Long Papers, pp. 6309–6318 (2021)
Google Scholar
Li, J., Xu, Y., Lv, T., Cui, L., Zhang, C., Wei, F.: Dit: self-supervised pre-training for document image transformer. In: The 30th ACM International Conference on Multimedia, pp. 3530–3539 (2022)
Google Scholar
Li, P., et al.: Selfdoc: self-supervised document representation learning. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5652–5660 (2021)
Google Scholar
Li, Y., et al.: Structext: structured text understanding with multi-modal transformers. In: The 21st ACM Multimedia Conference on Multimedia, pp. 1912–1920 (2021)
Google Scholar
Liu, Y., et al.: Roberta: a robustly optimized Bert pretraining approach. arXiv preprintarXiv:1907.11692 (2019)
Park, S., et al.: Cord: a consolidated receipt dataset for post-ocr parsing. In: Workshop on Document Intelligence at NeurIPS 2019 (2019)
Google Scholar
Peng, Q., et al.: ERNIE-layout: layout knowledge enhanced pre-training for visually-rich document understanding. In: Findings of the Association for Computational Linguistics: EMNLP 2022, pp. 3744–3756 (2022)
Google Scholar
Vaswani, A., et al.: Attention is all you need. Adv. Neural Inf. Process. Syst.30, 5998–6008 (2017)
Google Scholar
Wang, J., Jin, L., Ding, K.: LiLT: a simple yet effective language-independent layout transformer for structured document understanding. In: The 60th Annual Meeting of the Association for Computational Linguistics, vol. 1: Long Papers, pp. 7747–7757 (2022)
Google Scholar
Wang, J., et al.: Towards robust visual information extraction in real world: new dataset and novel solution. In: The AAAI Conference on Artificial Intelligence, pp. 2738–2745 (2021)
Google Scholar
Wang, W., et al.: Ernie-mmlayout: multi-grained multimodal transformer for document understanding. arXiv preprintarXiv:2209.08569 (2022)
Xu, Y., et al.: LayoutLMv2: multi-modal pre-training for visually-rich document understanding. In: The 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, vol. 1: Long Papers, pp. 2579–2591 (2021)
Google Scholar
Xu, Y., Li, M., Cui, L., Huang, S., Wei, F., Zhou, M.: Layoutlm: pre-training of text and layout for document image understanding. In: The 26th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pp. 1192–1200 (2020)
Google Scholar
Xu, Y., et al.: Layoutxlm: multimodal pre-training for multilingual visually-rich document understanding. arXiv preprintarXiv:2104.08836 (2021)
Zhai, M., et al.: Fast-structext: an efficient hourglass transformer with modality-guided dynamic token merge for document understanding. arXiv preprintarXiv:2305.11392 (2023)
Zhang, Z., Ma, J., Du, J., Wang, L., Zhang, J.: Multimodal pre-training based on graph attention network for document understanding. IEEE Trans. Multimedia25, 6743–6755 (2023)
Article Google Scholar

Download references

Acknowledgements

This work has been supported by the National Key Research and Development Program Grant 2020AAA0109700, and the National Natural Science Foundation of China (NSFC) Grant U23B2029.

Author information

Authors and Affiliations

School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, 100049, China
He-Sen Dai & Cheng-Lin Liu
State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation of Chinese Academy of Sciences, Beijing, 100190, China
He-Sen Dai, Xiao-Hui Li, Fei Yin & Cheng-Lin Liu
T Lab, Tencent Map, Tencent Technology (Beijing) Co., Ltd., Beijing, 100193, China
Xudong Yan & Shuqi Mei

Authors

He-Sen Dai
View author publications
You can also search for this author inPubMed Google Scholar
Xiao-Hui Li
View author publications
You can also search for this author inPubMed Google Scholar
Fei Yin
View author publications
You can also search for this author inPubMed Google Scholar
Xudong Yan
View author publications
You can also search for this author inPubMed Google Scholar
Shuqi Mei
View author publications
You can also search for this author inPubMed Google Scholar
Cheng-Lin Liu
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence toXiao-Hui Li.

Editor information

Editors and Affiliations

Luleå Tekniska Universitet, Luleå, Sweden
Elisa H. Barney Smith
Luleå Tekniska Universitet, Luleå, Sweden
Marcus Liwicki
Tsinghua University, Beijing, China
Liangrui Peng

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Dai, HS., Li, XH., Yin, F., Yan, X., Mei, S., Liu, CL. (2024). GraphMLLM: A Graph-Based Multi-level Layout Language-Independent Model for Document Understanding. In: Barney Smith, E.H., Liwicki, M., Peng, L. (eds) Document Analysis and Recognition - ICDAR 2024. ICDAR 2024. Lecture Notes in Computer Science, vol 14804. Springer, Cham. https://doi.org/10.1007/978-3-031-70533-5_14

Download citation

DOI:https://doi.org/10.1007/978-3-031-70533-5_14
Published:08 September 2024
Publisher Name:Springer, Cham
Print ISBN:978-3-031-70532-8
Online ISBN:978-3-031-70533-5
eBook Packages:Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The International Association for Pattern Recognition (opens in a new tab)

Movatterモバイル変換

GraphMLLM: A Graph-Based Multi-level Layout Language-Independent Model for Document Understanding

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

LayoutGCN: A Lightweight Architecture for Visually Rich Document Understanding

DistilDoc: Knowledge Distillation for Visually-Rich Document Applications

LAPDoc: Layout-Aware Prompting for Documents

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Societies and partnerships

Access this chapter

Subscribe and save

Buy Now