Part of the book series:Lecture Notes in Computer Science ((LNCS,volume 14804))
Included in the following conference series:
983Accesses
Abstract
Self-supervised multi-modal document pre-training for document knowledge learning shows superiority in various downstream tasks. However, due to the diversity of document languages and structures, there is still room to better model various document layouts while efficiently utilizing the pre-trained language models. To this goal, this paper proposes a Graph-based Multi-level Layout Language-independent Model (GraphMLLM) which uses dual-stream structure to explore textual and layout information separately and cooperatively. Specifically, GraphMLLM consists of a text stream which uses off-the-shelf pre-trained language model to explore textual semantics and a layout stream which uses multi-level graph neural network (GNN) to model hierarchical page layouts. Through the cooperation of the text stream and layout stream, GraphMLLM can model multi-level page layouts more comprehensively and improve the performance of language-independent document pre-trained model. Experimental results show that compared with previous state-of-the-art methods, GraphMLLM yields higher performance on downstream visual information extraction (VIE) tasks after pre-training on less documents. Code and model will be available athttps://github.com/HSDai/GraphMLLM.
This is a preview of subscription content,log in via an institution to check access.
Access this chapter
Subscribe and save
- Get 10 units per month
- Download Article/Chapter or eBook
- 1 Unit = 1 Article or 1 Chapter
- Cancel anytime
Buy Now
- Chapter
- JPY 3498
- Price includes VAT (Japan)
- eBook
- JPY 8465
- Price includes VAT (Japan)
- Softcover Book
- JPY 10581
- Price includes VAT (Japan)
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
References
Appalaraju, S., Jasani, B., Kota, B.U., Xie, Y., Manmatha, R.: Docformer: end-to-end transformer for document understanding. In: International Conference on Computer Vision, pp. 973–983 (2021)
Bao, H., Dong, L., Piao, S., Wei, F.: Beit: bert pre-training of image transformers. In: International Conference on Learning Representations (2021)
Chi, Z., et al.: InfoXLM: an information-theoretic framework for cross-lingual language model pre-training. In: 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 3576–3588 (2021)
Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 4171–4186 (2019)
Gu, Z., et al.: Xylayoutlm: towards layout-aware multimodal networks for visually-rich document understanding. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4583–4592 (2022)
He, P., Liu, X., Gao, J., Chen, W.: Deberta: decoding-enhanced Bert with disentangled attention. In: The 9th International Conference on Learning Representations (2021)
Hong, T., Kim, D., Ji, M., Hwang, W., Nam, D., Park, S.: BROS: a pre-trained language model focusing on text and layout for better key information extraction from documents. In: The 36th AAAI Conference on Artificial Intelligence, pp. 10767–10775 (2022)
Huang, Y., Lv, T., Cui, L., Lu, Y., Wei, F.: Layoutlmv3: pre-training for document AI with unified text and image masking. In: The 30th ACM International Conference on Multimedia, pp. 4083–4091 (2022)
Jaume, G., Ekenel, H.K., Thiran, J.: FUNSD: a dataset for form understanding in noisy scanned documents. In: The 2nd International Workshop on Open Services and Tools for Document Analysis, pp. 1–6 (2019)
Lewis, D.D., Agam, G., Argamon, S., Frieder, O., Grossman, D.A., Heard, J.: Building a test collection for complex document information processing. In: The 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 665–666 (2006)
Li, C., et al.: StructuralLM: structural pre-training for form understanding. In: The 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, vol. 1: Long Papers, pp. 6309–6318 (2021)
Li, J., Xu, Y., Lv, T., Cui, L., Zhang, C., Wei, F.: Dit: self-supervised pre-training for document image transformer. In: The 30th ACM International Conference on Multimedia, pp. 3530–3539 (2022)
Li, P., et al.: Selfdoc: self-supervised document representation learning. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5652–5660 (2021)
Li, Y., et al.: Structext: structured text understanding with multi-modal transformers. In: The 21st ACM Multimedia Conference on Multimedia, pp. 1912–1920 (2021)
Liu, Y., et al.: Roberta: a robustly optimized Bert pretraining approach. arXiv preprintarXiv:1907.11692 (2019)
Park, S., et al.: Cord: a consolidated receipt dataset for post-ocr parsing. In: Workshop on Document Intelligence at NeurIPS 2019 (2019)
Peng, Q., et al.: ERNIE-layout: layout knowledge enhanced pre-training for visually-rich document understanding. In: Findings of the Association for Computational Linguistics: EMNLP 2022, pp. 3744–3756 (2022)
Vaswani, A., et al.: Attention is all you need. Adv. Neural Inf. Process. Syst.30, 5998–6008 (2017)
Wang, J., Jin, L., Ding, K.: LiLT: a simple yet effective language-independent layout transformer for structured document understanding. In: The 60th Annual Meeting of the Association for Computational Linguistics, vol. 1: Long Papers, pp. 7747–7757 (2022)
Wang, J., et al.: Towards robust visual information extraction in real world: new dataset and novel solution. In: The AAAI Conference on Artificial Intelligence, pp. 2738–2745 (2021)
Wang, W., et al.: Ernie-mmlayout: multi-grained multimodal transformer for document understanding. arXiv preprintarXiv:2209.08569 (2022)
Xu, Y., et al.: LayoutLMv2: multi-modal pre-training for visually-rich document understanding. In: The 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, vol. 1: Long Papers, pp. 2579–2591 (2021)
Xu, Y., Li, M., Cui, L., Huang, S., Wei, F., Zhou, M.: Layoutlm: pre-training of text and layout for document image understanding. In: The 26th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pp. 1192–1200 (2020)
Xu, Y., et al.: Layoutxlm: multimodal pre-training for multilingual visually-rich document understanding. arXiv preprintarXiv:2104.08836 (2021)
Zhai, M., et al.: Fast-structext: an efficient hourglass transformer with modality-guided dynamic token merge for document understanding. arXiv preprintarXiv:2305.11392 (2023)
Zhang, Z., Ma, J., Du, J., Wang, L., Zhang, J.: Multimodal pre-training based on graph attention network for document understanding. IEEE Trans. Multimedia25, 6743–6755 (2023)
Acknowledgements
This work has been supported by the National Key Research and Development Program Grant 2020AAA0109700, and the National Natural Science Foundation of China (NSFC) Grant U23B2029.
Author information
Authors and Affiliations
School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, 100049, China
He-Sen Dai & Cheng-Lin Liu
State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation of Chinese Academy of Sciences, Beijing, 100190, China
He-Sen Dai, Xiao-Hui Li, Fei Yin & Cheng-Lin Liu
T Lab, Tencent Map, Tencent Technology (Beijing) Co., Ltd., Beijing, 100193, China
Xudong Yan & Shuqi Mei
- He-Sen Dai
You can also search for this author inPubMed Google Scholar
- Xiao-Hui Li
You can also search for this author inPubMed Google Scholar
- Fei Yin
You can also search for this author inPubMed Google Scholar
- Xudong Yan
You can also search for this author inPubMed Google Scholar
- Shuqi Mei
You can also search for this author inPubMed Google Scholar
- Cheng-Lin Liu
You can also search for this author inPubMed Google Scholar
Corresponding author
Correspondence toXiao-Hui Li.
Editor information
Editors and Affiliations
Luleå Tekniska Universitet, Luleå, Sweden
Elisa H. Barney Smith
Luleå Tekniska Universitet, Luleå, Sweden
Marcus Liwicki
Tsinghua University, Beijing, China
Liangrui Peng
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Dai, HS., Li, XH., Yin, F., Yan, X., Mei, S., Liu, CL. (2024). GraphMLLM: A Graph-Based Multi-level Layout Language-Independent Model for Document Understanding. In: Barney Smith, E.H., Liwicki, M., Peng, L. (eds) Document Analysis and Recognition - ICDAR 2024. ICDAR 2024. Lecture Notes in Computer Science, vol 14804. Springer, Cham. https://doi.org/10.1007/978-3-031-70533-5_14
Download citation
Published:
Publisher Name:Springer, Cham
Print ISBN:978-3-031-70532-8
Online ISBN:978-3-031-70533-5
eBook Packages:Computer ScienceComputer Science (R0)
Share this paper
Anyone you share the following link with will be able to read this content:
Sorry, a shareable link is not currently available for this article.
Provided by the Springer Nature SharedIt content-sharing initiative