Movatterモバイル変換


[0]ホーム

URL:


Skip to main content

Advertisement

Springer Nature Link
Log in

GraphMLLM: A Graph-Based Multi-level Layout Language-Independent Model for Document Understanding

  • Conference paper
  • First Online:

Part of the book series:Lecture Notes in Computer Science ((LNCS,volume 14804))

Included in the following conference series:

  • 983Accesses

Abstract

Self-supervised multi-modal document pre-training for document knowledge learning shows superiority in various downstream tasks. However, due to the diversity of document languages and structures, there is still room to better model various document layouts while efficiently utilizing the pre-trained language models. To this goal, this paper proposes a Graph-based Multi-level Layout Language-independent Model (GraphMLLM) which uses dual-stream structure to explore textual and layout information separately and cooperatively. Specifically, GraphMLLM consists of a text stream which uses off-the-shelf pre-trained language model to explore textual semantics and a layout stream which uses multi-level graph neural network (GNN) to model hierarchical page layouts. Through the cooperation of the text stream and layout stream, GraphMLLM can model multi-level page layouts more comprehensively and improve the performance of language-independent document pre-trained model. Experimental results show that compared with previous state-of-the-art methods, GraphMLLM yields higher performance on downstream visual information extraction (VIE) tasks after pre-training on less documents. Code and model will be available athttps://github.com/HSDai/GraphMLLM.

This is a preview of subscription content,log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
¥17,985 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
JPY 3498
Price includes VAT (Japan)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
JPY 8465
Price includes VAT (Japan)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
JPY 10581
Price includes VAT (Japan)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide -see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Similar content being viewed by others

References

  1. Appalaraju, S., Jasani, B., Kota, B.U., Xie, Y., Manmatha, R.: Docformer: end-to-end transformer for document understanding. In: International Conference on Computer Vision, pp. 973–983 (2021)

    Google Scholar 

  2. Bao, H., Dong, L., Piao, S., Wei, F.: Beit: bert pre-training of image transformers. In: International Conference on Learning Representations (2021)

    Google Scholar 

  3. Chi, Z., et al.: InfoXLM: an information-theoretic framework for cross-lingual language model pre-training. In: 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 3576–3588 (2021)

    Google Scholar 

  4. Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 4171–4186 (2019)

    Google Scholar 

  5. Gu, Z., et al.: Xylayoutlm: towards layout-aware multimodal networks for visually-rich document understanding. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4583–4592 (2022)

    Google Scholar 

  6. He, P., Liu, X., Gao, J., Chen, W.: Deberta: decoding-enhanced Bert with disentangled attention. In: The 9th International Conference on Learning Representations (2021)

    Google Scholar 

  7. Hong, T., Kim, D., Ji, M., Hwang, W., Nam, D., Park, S.: BROS: a pre-trained language model focusing on text and layout for better key information extraction from documents. In: The 36th AAAI Conference on Artificial Intelligence, pp. 10767–10775 (2022)

    Google Scholar 

  8. Huang, Y., Lv, T., Cui, L., Lu, Y., Wei, F.: Layoutlmv3: pre-training for document AI with unified text and image masking. In: The 30th ACM International Conference on Multimedia, pp. 4083–4091 (2022)

    Google Scholar 

  9. Jaume, G., Ekenel, H.K., Thiran, J.: FUNSD: a dataset for form understanding in noisy scanned documents. In: The 2nd International Workshop on Open Services and Tools for Document Analysis, pp. 1–6 (2019)

    Google Scholar 

  10. Lewis, D.D., Agam, G., Argamon, S., Frieder, O., Grossman, D.A., Heard, J.: Building a test collection for complex document information processing. In: The 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 665–666 (2006)

    Google Scholar 

  11. Li, C., et al.: StructuralLM: structural pre-training for form understanding. In: The 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, vol. 1: Long Papers, pp. 6309–6318 (2021)

    Google Scholar 

  12. Li, J., Xu, Y., Lv, T., Cui, L., Zhang, C., Wei, F.: Dit: self-supervised pre-training for document image transformer. In: The 30th ACM International Conference on Multimedia, pp. 3530–3539 (2022)

    Google Scholar 

  13. Li, P., et al.: Selfdoc: self-supervised document representation learning. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5652–5660 (2021)

    Google Scholar 

  14. Li, Y., et al.: Structext: structured text understanding with multi-modal transformers. In: The 21st ACM Multimedia Conference on Multimedia, pp. 1912–1920 (2021)

    Google Scholar 

  15. Liu, Y., et al.: Roberta: a robustly optimized Bert pretraining approach. arXiv preprintarXiv:1907.11692 (2019)

  16. Park, S., et al.: Cord: a consolidated receipt dataset for post-ocr parsing. In: Workshop on Document Intelligence at NeurIPS 2019 (2019)

    Google Scholar 

  17. Peng, Q., et al.: ERNIE-layout: layout knowledge enhanced pre-training for visually-rich document understanding. In: Findings of the Association for Computational Linguistics: EMNLP 2022, pp. 3744–3756 (2022)

    Google Scholar 

  18. Vaswani, A., et al.: Attention is all you need. Adv. Neural Inf. Process. Syst.30, 5998–6008 (2017)

    Google Scholar 

  19. Wang, J., Jin, L., Ding, K.: LiLT: a simple yet effective language-independent layout transformer for structured document understanding. In: The 60th Annual Meeting of the Association for Computational Linguistics, vol. 1: Long Papers, pp. 7747–7757 (2022)

    Google Scholar 

  20. Wang, J., et al.: Towards robust visual information extraction in real world: new dataset and novel solution. In: The AAAI Conference on Artificial Intelligence, pp. 2738–2745 (2021)

    Google Scholar 

  21. Wang, W., et al.: Ernie-mmlayout: multi-grained multimodal transformer for document understanding. arXiv preprintarXiv:2209.08569 (2022)

  22. Xu, Y., et al.: LayoutLMv2: multi-modal pre-training for visually-rich document understanding. In: The 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, vol. 1: Long Papers, pp. 2579–2591 (2021)

    Google Scholar 

  23. Xu, Y., Li, M., Cui, L., Huang, S., Wei, F., Zhou, M.: Layoutlm: pre-training of text and layout for document image understanding. In: The 26th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pp. 1192–1200 (2020)

    Google Scholar 

  24. Xu, Y., et al.: Layoutxlm: multimodal pre-training for multilingual visually-rich document understanding. arXiv preprintarXiv:2104.08836 (2021)

  25. Zhai, M., et al.: Fast-structext: an efficient hourglass transformer with modality-guided dynamic token merge for document understanding. arXiv preprintarXiv:2305.11392 (2023)

  26. Zhang, Z., Ma, J., Du, J., Wang, L., Zhang, J.: Multimodal pre-training based on graph attention network for document understanding. IEEE Trans. Multimedia25, 6743–6755 (2023)

    Article  Google Scholar 

Download references

Acknowledgements

This work has been supported by the National Key Research and Development Program Grant 2020AAA0109700, and the National Natural Science Foundation of China (NSFC) Grant U23B2029.

Author information

Authors and Affiliations

  1. School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, 100049, China

    He-Sen Dai & Cheng-Lin Liu

  2. State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation of Chinese Academy of Sciences, Beijing, 100190, China

    He-Sen Dai, Xiao-Hui Li, Fei Yin & Cheng-Lin Liu

  3. T Lab, Tencent Map, Tencent Technology (Beijing) Co., Ltd., Beijing, 100193, China

    Xudong Yan & Shuqi Mei

Authors
  1. He-Sen Dai

    You can also search for this author inPubMed Google Scholar

  2. Xiao-Hui Li

    You can also search for this author inPubMed Google Scholar

  3. Fei Yin

    You can also search for this author inPubMed Google Scholar

  4. Xudong Yan

    You can also search for this author inPubMed Google Scholar

  5. Shuqi Mei

    You can also search for this author inPubMed Google Scholar

  6. Cheng-Lin Liu

    You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence toXiao-Hui Li.

Editor information

Editors and Affiliations

  1. Luleå Tekniska Universitet, Luleå, Sweden

    Elisa H. Barney Smith

  2. Luleå Tekniska Universitet, Luleå, Sweden

    Marcus Liwicki

  3. Tsinghua University, Beijing, China

    Liangrui Peng

Rights and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Dai, HS., Li, XH., Yin, F., Yan, X., Mei, S., Liu, CL. (2024). GraphMLLM: A Graph-Based Multi-level Layout Language-Independent Model for Document Understanding. In: Barney Smith, E.H., Liwicki, M., Peng, L. (eds) Document Analysis and Recognition - ICDAR 2024. ICDAR 2024. Lecture Notes in Computer Science, vol 14804. Springer, Cham. https://doi.org/10.1007/978-3-031-70533-5_14

Download citation

Publish with us

Societies and partnerships

Access this chapter

Subscribe and save

Springer+ Basic
¥17,985 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
JPY 3498
Price includes VAT (Japan)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
JPY 8465
Price includes VAT (Japan)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
JPY 10581
Price includes VAT (Japan)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide -see info

Tax calculation will be finalised at checkout

Purchases are for personal use only


[8]ページ先頭

©2009-2025 Movatter.jp