Movatterモバイル変換


[0]ホーム

URL:


Skip to main content

Advertisement

Springer Nature Link
Log in

Enhancing keyphrase extraction from long scientific documents using graph embeddings

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

This study explores the integration of graph neural network (GNN) representations with pre-trained language models (PLMs) to enhance keyphrase extraction (KPE) from lengthy documents. We demonstrate that incorporating graph embeddings into PLMs yields richer semantic representations, especially for long texts. Our approach constructs a co-occurrence graph of the document, which we then embed using a graph convolutional network (GCN) trained for edge prediction. This process captures non-sequential relationships and long-distance dependencies, both of which are often crucial in lengthy documents. We introduce a novelgraph-enhanced sequence tagging architecture that combines PLM-based contextual embeddings with GNN-derived representations. Through evaluations on benchmark datasets, our method outperforms state-of-the-art models, showing notable improvements in F1 scores. Beyond performance on standard benchmarks, this approach also holds promise in domains such as legal, medical, and scientific document processing, where efficient handling of long texts is vital. Our findings underscore the potential for GNNs to complement PLMs, helping address both technical and real-world challenges in KPE for long documents.

This is a preview of subscription content,log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic
¥17,985 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price includes VAT (Japan)

Instant access to the full article PDF.

Fig. 1
Algorithm 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

Explore related subjects

Discover the latest articles and news from researchers in related subjects, suggested using machine learning.

Data Availability and Access

The datasets supporting the conclusions of this article are all publicly available and openly published for research purposes. Links to these datasets can be found in Section4.1 Datasets of this paper. The availability of these datasets ensures transparency and allows for the reproducibility of the research findings. For a detailed reference to the datasets used, including URLs and access methods, please refer to Section4.1 in this document.

Notes

References

  1. Alsentzer E, Murphy J, Boag W, Weng WH, Jindi D, Naumann T, McDermott M (2019) Publicly available clinical BERT embeddings. In Rumshisky A, Roberts K, Bethard S, Naumann T (eds) Proceedings of the 2nd clinical natural language processing workshop. Minneapolis, Minnesota, USA, pp 72–78. Association for Computational Linguistics

  2. Alzaidy R, Caragea C, Giles CL (2019) Bi-lstm-crf sequence labeling for keyphrase extraction from scholarly documents. In The world wide web conference. WWW ’19, New York, NY, USA, pp 2551-2557. Association for Computing Machinery

  3. Beltagy I, Lo K, Cohan A (2019) SciBERT: A pretrained language model for scientific text. In Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP). Hong Kong, China, pp 3615–3620. Association for Computational Linguistics

  4. Beltagy I, Peters ME, Cohan A (2020) Longformer: The long-document transformer

  5. Bennani-Smires K, Musat C, Hossmann A, Baeriswyl M, Jaggi M (2018) Simple unsupervised keyphrase extraction using sentence embeddings. In Proceedings of the 22nd conference on computational natural language learning. Brussels, Belgium, pp 221–229. Association for Computational Linguistics

  6. Boudin F (2013) A comparison of centrality measures for graph-based keyphrase extraction. In Proceedings of the sixth international joint conference on natural language processing, pp 834–838

  7. Bougouin A, Boudin F, Daille B (2013) TopicRank: Graph-based topic ranking for keyphrase extraction. In Proceedings of the sixth international joint conference on natural language processing. Nagoya, Japan, pp 543–551. Asian Federation of Natural Language Processing

  8. Çano E, Bojar O (2019) Keyphrase generation: A multi-aspect survey. In 2019 25th Conference of open innovations association (FRUCT). IEEE, pp 85–94

  9. Chalkidis I, Fergadiotis M, Malakasiotis P, Aletras N, Androutsopoulos I (2020) LEGAL-BERT: The muppets straight out of law school. In: Cohn T, He Y, Liu Y (eds) Findings of the association for computational linguistics: EMNLP 2020, Online. Association for Computational Linguistics, pp 2898–2904

    Chapter  Google Scholar 

  10. Chen Y, Chen Z, Amin HU (2023) Synergistic similarity graph construction (ssgc) for steel plate fault diagnosis with graph attention networks. In 2023 IEEE 6th international conference on knowledge innovation and invention (ICKII), pp 655–660

  11. Devlin J, Chang MW, Lee K, Toutanova K (2019) BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 conference of the north american chapter of the association for computational linguistics: human language technologies, Volume 1 (Long and Short Papers). Minneapolis, Minnesota, pp 4171–4186. Association for Computational Linguistics

  12. Dočekal M, Smrž P (2022) Query-based keyphrase extraction from long documents. Volume 35. University of Florida George A Smathers Libraries

  13. Feng K, Ji J, Zhang Y, Ni Q, Liu Z, Beer M (2023) Digital twin-driven intelligent assessment of gear surface degradation. vol 186, pp 109896

  14. Feng K, Xu Y, Wang Y, Li S, Jiang Q, Sun B, Zheng J, Ni Q (2023) Digital twin enabled domain adversarial graph networks for bearing fault diagnosis. vol 1, pp 113–122

  15. Garg K, Chowdhury JR, Caragea C (2021) Keyphrase generation beyond the boundaries of title and abstract. In Conference on empirical methods in natural language processing

  16. Gollapalli SD, Li Xl, Yang P (2017) Incorporating expert knowledge into keyphrase extraction. Volume 31

  17. Grail Q, Perez J, Gaussier E (2021) Globalizing BERT-based transformer architectures for long document summarization. In Proceedings of the 16th conference of the european chapter of the association for computational linguistics: main volume, Online. Association for Computational Linguistics, pp 1792–1810

  18. Grover A, Leskovec J (2016) Node2vec: Scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining. KDD ’16, New York, NY, USA, pp 855–864. Association for Computing Machinery

  19. Gutwin C, Paynter G, Witten I, Nevill-Manning C, Frank E (1999) Improving browsing in digital libraries with keyphrase indexes. Elsevier, vol 27, pp 81–104

  20. Hamilton WL, Ying R, Leskovec J (2017) Inductive representation learning on large graphs. In Proceedings of the 31st international conference on neural information processing systems. NIPS’17, Red Hook, NY, USA, pp 1025-1035. Curran Associates Inc

  21. Hammouda KM, Matute DN, Kamel MS (2005) Corephrase: Keyphrase extraction for document clustering. In International workshop on machine learning and data mining in pattern recognition. Springer, pp 265–274

  22. Hasan KS, Ng V (2014) Automatic keyphrase extraction: A survey of the state of the art. In Proceedings of the 52nd annual meeting of the association for computational linguistics (volume 1: Long Papers), pp 1262–1273

  23. Huang Z, Xu W, Yu K (2015) Bidirectional lstm-crf models for sequence tagging

  24. Hulth A (2003) Improved automatic keyword extraction given more linguistic knowledge. In Proceedings of the 2003 conference on empirical methods in natural language processing. EMNLP ’03, USA, pp 216–223. Association for Computational Linguistics

  25. Hulth A, Megyesi B (2006) A study on automatically extracted keywords in text categorization. In Proceedings of the 21st international conference on computational linguistics and 44th annual meeting of the association for computational linguistics, pp 537–544

  26. Jones S, Staveley MS (1999) Phrasier: a system for interactive document retrieval using keyphrases. In Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval, pp 160–167

  27. Kim SN, Kan MY (2009) Re-examining automatic keyphrase extraction approaches in scientific articles. In Proceedings of the workshop on multiword expressions: identification, interpretation, disambiguation and applications (MWE 2009). Singapore, pp 9–16. Association for Computational Linguistics

  28. Kim SN, Medelyan O, Kan MY, Baldwin T (2010) Semeval-2010 task 5: Automatic keyphrase extraction from scientific articles. In Proceedings of the 5th international workshop on semantic evaluation. SemEval ’10, USA, pp 21–26. Association for Computational Linguistics

  29. Kipf TN, Welling M (2017) Semi-supervised classification with graph convolutional networks. In International conference on learning representations

  30. Kulkarni M, Mahata D, Arora R, Bhowmik R (2022) Learning rich representation of keyphrases from text. In Findings of the association for computational linguistics: NAACL 2022. Seattle, United States, pp 891–906. Association for Computational Linguistics

  31. Lee J, Yoon W, Kim S, Kim D, Kim S, So CH, Kang J (2019) Biobert: a pre-trained biomedical language representation model for biomedical text mining. vol 36, pp 1234–1240

  32. Lo K, Wang LL, Neumann M, Kinney R, Weld D (2020) S2ORC: The semantic scholar open research corpus. In Proceedings of the 58th annual meeting of the association for computational linguistics, Online. Association for Computational Linguistics, pp 4969–4983

  33. Mahata D, Kuriakose J, Shah R, Zimmermann R (2018a) Key2vec: Automatic ranked keyphrase extraction from scientific articles using phrase embeddings. In Proceedings of the 2018 conference of the north american chapter of the association for computational linguistics: human language technologies, volume 2 (Short Papers), pp 634–639

  34. Mahata D, Shah RR, Kuriakose J, Zimmermann R, Talburt JR (2018b) Theme-weighted ranking of keywords from text documents using phrase embeddings. pp 184–189

  35. Martinc M, Škrlj B, Pollak S (2021) TNT-KID: Transformer-based neural tagger for keyword identification. Cambridge University Press (CUP), vol 28, pp 409–448

  36. May MC, Neidhöfer J, Körner T, Schäfer L, Lanza G (2022) Applying natural language processing in manufacturing. 10th CIRP Global Web Conference – Material Aspects of Manufacturing Processes, vol 115, pp 184–189

  37. Meng R, Mahata D, Boudin F (2022) From fundamentals to recent advances: A tutorial on keyphrasification. In Advances in information retrieval: 44th european conference on IR research, ECIR 2022, Stavanger, Norway, April 10–14, 2022, Proceedings, Part II, pp 582–588. Springer

  38. Miaschi A, Dell’Orletta F (2020) Contextual and non-contextual word embeddings: an in-depth linguistic investigation. In Gella S, Welbl J, Rei M, Petroni F, Lewis P, Strubell E, Seo M, Hajishirzi H (eds) Proceedings of the 5th workshop on representation learning for NLP, Online. Association for Computational Linguistics, pp 110–119

  39. Mienye E, Jere N, Obaido G, Mienye ID (2024) Aruleba K. Deep learning in finance: A survey of applications and techniques. 5:2066–2091

    Google Scholar 

  40. Mihalcea R, Tarau P (2004) TextRank: Bringing order into text. In Proceedings of the 2004 conference on empirical methods in natural language processing. Barcelona, Spain, pp 404–411. Association for Computational Linguistics

  41. Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient estimation of word representations in vector space

  42. Mothe J, Ramiandrisoa F, Rasolomanana M (2018) Automatic keyphrase extraction using graph-based methods. In Proceedings of the 33rd annual ACM symposium on applied computing, pp 728–730

  43. Nguyen TD, Kan MY (2007) Keyphrase extraction in scientific publications. In Goh DH-L, Cao TH, Slvberg IT, Rasmussen E (eds) Asian Digital Libraries. Looking Back 10 Years and Forging New Frontiers, Berlin, Heidelberg, pp 317–326. Springer Berlin Heidelberg

  44. Ni Q, Ji J, Feng K, Zhang Y, Lin D, Zheng J (2024) Data-driven bearing health management using a novel multi-scale fused feature and gated recurrent unit. vol 242, pp 109753

  45. Ni Q, Ji J, Halkon B, Feng K, Nandi AK (2023) Physics-informed residual network (piresnet) for rolling element bearing fault diagnostics. vol 200, pp 110544

  46. Page L, Brin S, Motwani R, Winograd T (1999) The pagerank citation ranking : Bringing order to the web. In The web conference

  47. Park H, Vyas Y, Shah K (2022) Efficient classification of long documents using transformers. In Proceedings of the 60th annual meeting of the association for computational linguistics (volume 2: short papers). Dublin, Ireland, pp 702–709. Association for Computational Linguistics

  48. Park S, Caragea C (2020) Scientific keyphrase identification and classification by pre-trained language models intermediate task transfer learning. In Proceedings of the 28th international conference on computational linguistics, pp 5409–5419

  49. Patel K, Caragea C (2019) Exploring word embeddings in crf-based keyphrase extraction from research papers. In Proceedings of the 10th international conference on knowledge capture, pp 37–44

  50. Pennington J, Socher R, Manning C (2014) GloVe: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). Doha, Qatar, pp 1532–1543. Association for Computational Linguistics

  51. Perozzi B, Al-Rfou R, Skiena S (2014) Deepwalk: Online learning of social representations. In Proceedings of the 20th ACM SIGKDD international conference on knowledge discovery and data mining. KDD ’14, New York, NY, USA, pp 701–710. Association for Computing Machinery

  52. Pham H, Wang G, Lu Y, Florencio D, Zhang C (2022) Understanding long documents with different position-aware attentions

  53. Qazvinian V, Radev D, Özgür A (2010) Citation summarization through keyphrase extraction. In Proceedings of the 23rd international conference on computational linguistics (COLING 2010), pp 895–903

  54. Rousseau F, Vazirgiannis M (2015) Main core retention on graph-of-words for single-document keyword extraction. In: Hanbury A, Kazai G, Rauber A, Fuhr N (eds) Advances in information retrieval, Cham. Springer International Publishing, pp 382–393

    Chapter  Google Scholar 

  55. Rungta M, Kumar R, Dhaliwal MP, Tiwari H, Vala V (2020) Transkp: Transformer based key-phrase extraction. pp 1–7

  56. Sahrawat D, Mahata D, Zhang H, Kulkarni M, Sharma A, Gosangi R, Stent A, Kumar Y, Shah RR, Zimmermann R (2020) Keyphrase extraction as sequence labeling using contextualized embeddings. In European conference on information retrieval. Springer, pp 328–335

  57. Santosh T, Kumar Sanyal D, Bhowmick PK, Das PP (2020) SaSAKE: Syntax and semantics aware keyphrase extraction from research papers. In Proceedings of the 28th international conference on computational linguistics, Barcelona, Spain (Online). International Committee on Computational Linguistics, pp 5372–5383

  58. Saxena A, Fletcher G, Pechenizkiy M (2021) Nodesim: node similarity based network embedding for diverse link prediction. vol 11

  59. Song IY, Allen RB, Obradovic Z, Song M (2006) Keyphrase extraction-based query expansion in digital libraries. In Proceedings of the 6th ACM/IEEE-CS joint conference on digital libraries (JCDL’06). IEEE, pp 202–209

  60. Stark C, Breitkreutz BJ, Reguly T, Boucher L, Breitkreutz A, Tyers M (2006) Biogrid: a general repository for interaction datasets. vol 34, pp D535–D539

  61. Sun Z, Tang J, Du P, Deng ZH, Nie JY (2019) Divgraphpointer: A graph pointer network for extracting diverse keyphrases. In Proceedings of the 42nd international ACM SIGIR conference on research and development in information retrieval, SIGIR’19, New York, NY, USA, pp 755–764. Association for Computing Machinery

  62. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I (2017) Attention is all you need. In Proceedings of the 31st international conference on neural information processing systems, NIPS’17, Red Hook, NY, USA, pp 6000–6010. Curran Associates Inc

  63. Veličković P, Cucurull G, Casanova A, Romero A, Liò P, Bengio Y (2018) Graph attention networks. In International conference on learning representations

  64. Wan X, Xiao J (2008) Single document keyphrase extraction using neighborhood knowledge. In Proceedings of the 23rd national conference on artificial intelligence - volume 2, AAAI’08, pp 855–860. AAAI Press

  65. Wang P, Agarwal K, Ham C, Choudhury S, Reddy CK (2021) Self-supervised learning of contextual embeddings for link prediction in heterogeneous networks. In Proceedings of the web conference 2021, WWW ’21, New York, NY, USA, pp 2946–2957. Association for Computing Machinery

  66. Wang R, Liu W, McDonald C (2014) Corpus-independent generic keyphrase extraction using word embedding vectors. Softw Eng Res Conf 39:1–8

    Google Scholar 

  67. Xiao C, Liu Z, Lin Y, Sun M (2023) Legal knowledge representation learning. Springer Nature Singapore, Singapore, pp 401–432

    Google Scholar 

  68. Yang L, Zhang M, Li C, Bendersky M, Najork M (2020) Beyond 512 tokens: Siamese multi-depth transformer-based hierarchical encoder for long-form document matching. CIKM ’20, New York, NY, USA, pp 1725–1734. Association for Computing Machinery

  69. Yang Y, UY MCS, Huang A (2020) Finbert: A pretrained language model for financial communications

  70. Yasunaga M, Kasai J, Zhang R, Fabbri AR, Li I, Friedman D, Radev DR (2019) Scisummnet: a large annotated corpus and content-impact models for scientific paper summarization with citation networks. In Proceedings of the thirty-third AAAI conference on artificial intelligence and thirty-first innovative applications of artificial intelligence conference and ninth AAAI symposium on educational advances in artificial intelligence, AAAI’19/IAAI’19/EAAI’19. AAAI Press

  71. Ye J, Cai R, Gui T, Zhang Q (2021) Heterogeneous graph neural networks for keyphrase generation. In Proceedings of the 2021 conference on empirical methods in natural language processing, Online and Punta Cana, Dominican Republic, pp 2705–2715. Association for Computational Linguistics

  72. Ying R, He R, Chen K, Eksombatchai P, Hamilton WL, Leskovec J (2018) Graph convolutional neural networks for web-scale recommender systems. In Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining, KDD ’18, New York, NY, USA, pp 974-983. Association for Computing Machinery

  73. Zaheer M, Guruganesh G, Dubey A, Ainslie J, Alberti C, Ontanon S, Pham P, Ravula A, Wang Q, Yang L, Ahmed A (2020) Big bird: Transformers for longer sequences. In Proceedings of the 34th international conference on neural information processing systems, NIPS’20, Red Hook, NY, USA. Curran Associates Inc

  74. Zhang R, Wei Z, Shi Y, Chen Y (2020) Bert-al: Bert for arbitrarily long document understanding

  75. Zhang Y, Zincir-Heywood N, Milios E (2004) World wide web site summarization. IOS Press, vol 2, pp 39–53

  76. Zhong K, Jackson T, West A, Cosma G (2024) Natural language processing approaches in industrial maintenance: A systematic literature review. 5th International Conference on Industry 4.0 and Smart Manufacturing (ISM 2023), vol 232, pp 2082–2097

Download references

Acknowledgements

In the course of our research, we would like to acknowledge the insightful discussions and ideas generated from the preprint version of our own work, titled ’Enhancing KPE from Long Scientific Documents using Graph Embeddings’ authored by Roberto Martínez-Cruz, Debanjan Mahata, Alvaro J. López-López and José Portela. This preprint was made available on arXiv (https://arxiv.org/abs/2305.09316) in May 2023.

Author information

Author notes
  1. Roberto Martínez-Cruz and Debanjan Mahata contributed equally to this work.

Authors and Affiliations

  1. Institute for Research in Technology, ICAI School of Engineering, Comillas Pontifical University, Madrid, Spain

    Roberto Martínez-Cruz, Alvaro J. López-López & José Portela

  2. Bloomberg, New York, NY, USA

    Debanjan Mahata

Authors
  1. Roberto Martínez-Cruz
  2. Debanjan Mahata
  3. Alvaro J. López-López
  4. José Portela

Contributions

Roberto Martínez-Cruz and Debanjan Mahata contributed equally to the conception and design of the study, as well as the development of the methodology and analysis of data. Alvaro J. López-López and José Portela contributed to the collection and assembly of data, as well as providing critical revisions that added important intellectual content. All authors participated in drafting the manuscript and approved the final version to be published.

Corresponding author

Correspondence toRoberto Martínez-Cruz.

Ethics declarations

Competing Interests

The authors declare no competing interests related to this study. All methodologies, analyses, and interpretations of data were conducted independently and without influence from external entities. This research was purely academic and aimed at contributing to the existing body of knowledge on keyphrase extraction from long documents using graph embeddings.

Ethical Approval and Informed Consent

This research did not involve any human participants, data, or tissues, and therefore did not require ethical approval or informed consent. All datasets utilized in this study are publicly available and were openly published for research purposes, adhering to all applicable ethical standards.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Profiles

  1. Roberto Martínez-CruzView author profile

Access this article

Subscribe and save

Springer+ Basic
¥17,985 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price includes VAT (Japan)

Instant access to the full article PDF.

Advertisement


[8]ページ先頭

©2009-2025 Movatter.jp