- Roberto Martínez-Cruz ORCID:orcid.org/0009-0009-4264-16781 na1,
- Debanjan Mahata2 na1,
- Alvaro J. López-López1 &
- …
- José Portela1
151Accesses
Abstract
This study explores the integration of graph neural network (GNN) representations with pre-trained language models (PLMs) to enhance keyphrase extraction (KPE) from lengthy documents. We demonstrate that incorporating graph embeddings into PLMs yields richer semantic representations, especially for long texts. Our approach constructs a co-occurrence graph of the document, which we then embed using a graph convolutional network (GCN) trained for edge prediction. This process captures non-sequential relationships and long-distance dependencies, both of which are often crucial in lengthy documents. We introduce a novelgraph-enhanced sequence tagging architecture that combines PLM-based contextual embeddings with GNN-derived representations. Through evaluations on benchmark datasets, our method outperforms state-of-the-art models, showing notable improvements in F1 scores. Beyond performance on standard benchmarks, this approach also holds promise in domains such as legal, medical, and scientific document processing, where efficient handling of long texts is vital. Our findings underscore the potential for GNNs to complement PLMs, helping address both technical and real-world challenges in KPE for long documents.
This is a preview of subscription content,log in via an institution to check access.
Access this article
Subscribe and save
- Get 10 units per month
- Download Article/Chapter or eBook
- 1 Unit = 1 Article or 1 Chapter
- Cancel anytime
Buy Now
Price includes VAT (Japan)
Instant access to the full article PDF.










Similar content being viewed by others
Explore related subjects
Discover the latest articles and news from researchers in related subjects, suggested using machine learning.Data Availability and Access
The datasets supporting the conclusions of this article are all publicly available and openly published for research purposes. Links to these datasets can be found in Section4.1 Datasets of this paper. The availability of these datasets ensures transparency and allows for the reproducibility of the research findings. For a detailed reference to the datasets used, including URLs and access methods, please refer to Section4.1 in this document.
Notes
References
Alsentzer E, Murphy J, Boag W, Weng WH, Jindi D, Naumann T, McDermott M (2019) Publicly available clinical BERT embeddings. In Rumshisky A, Roberts K, Bethard S, Naumann T (eds) Proceedings of the 2nd clinical natural language processing workshop. Minneapolis, Minnesota, USA, pp 72–78. Association for Computational Linguistics
Alzaidy R, Caragea C, Giles CL (2019) Bi-lstm-crf sequence labeling for keyphrase extraction from scholarly documents. In The world wide web conference. WWW ’19, New York, NY, USA, pp 2551-2557. Association for Computing Machinery
Beltagy I, Lo K, Cohan A (2019) SciBERT: A pretrained language model for scientific text. In Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP). Hong Kong, China, pp 3615–3620. Association for Computational Linguistics
Beltagy I, Peters ME, Cohan A (2020) Longformer: The long-document transformer
Bennani-Smires K, Musat C, Hossmann A, Baeriswyl M, Jaggi M (2018) Simple unsupervised keyphrase extraction using sentence embeddings. In Proceedings of the 22nd conference on computational natural language learning. Brussels, Belgium, pp 221–229. Association for Computational Linguistics
Boudin F (2013) A comparison of centrality measures for graph-based keyphrase extraction. In Proceedings of the sixth international joint conference on natural language processing, pp 834–838
Bougouin A, Boudin F, Daille B (2013) TopicRank: Graph-based topic ranking for keyphrase extraction. In Proceedings of the sixth international joint conference on natural language processing. Nagoya, Japan, pp 543–551. Asian Federation of Natural Language Processing
Çano E, Bojar O (2019) Keyphrase generation: A multi-aspect survey. In 2019 25th Conference of open innovations association (FRUCT). IEEE, pp 85–94
Chalkidis I, Fergadiotis M, Malakasiotis P, Aletras N, Androutsopoulos I (2020) LEGAL-BERT: The muppets straight out of law school. In: Cohn T, He Y, Liu Y (eds) Findings of the association for computational linguistics: EMNLP 2020, Online. Association for Computational Linguistics, pp 2898–2904
Chen Y, Chen Z, Amin HU (2023) Synergistic similarity graph construction (ssgc) for steel plate fault diagnosis with graph attention networks. In 2023 IEEE 6th international conference on knowledge innovation and invention (ICKII), pp 655–660
Devlin J, Chang MW, Lee K, Toutanova K (2019) BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 conference of the north american chapter of the association for computational linguistics: human language technologies, Volume 1 (Long and Short Papers). Minneapolis, Minnesota, pp 4171–4186. Association for Computational Linguistics
Dočekal M, Smrž P (2022) Query-based keyphrase extraction from long documents. Volume 35. University of Florida George A Smathers Libraries
Feng K, Ji J, Zhang Y, Ni Q, Liu Z, Beer M (2023) Digital twin-driven intelligent assessment of gear surface degradation. vol 186, pp 109896
Feng K, Xu Y, Wang Y, Li S, Jiang Q, Sun B, Zheng J, Ni Q (2023) Digital twin enabled domain adversarial graph networks for bearing fault diagnosis. vol 1, pp 113–122
Garg K, Chowdhury JR, Caragea C (2021) Keyphrase generation beyond the boundaries of title and abstract. In Conference on empirical methods in natural language processing
Gollapalli SD, Li Xl, Yang P (2017) Incorporating expert knowledge into keyphrase extraction. Volume 31
Grail Q, Perez J, Gaussier E (2021) Globalizing BERT-based transformer architectures for long document summarization. In Proceedings of the 16th conference of the european chapter of the association for computational linguistics: main volume, Online. Association for Computational Linguistics, pp 1792–1810
Grover A, Leskovec J (2016) Node2vec: Scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining. KDD ’16, New York, NY, USA, pp 855–864. Association for Computing Machinery
Gutwin C, Paynter G, Witten I, Nevill-Manning C, Frank E (1999) Improving browsing in digital libraries with keyphrase indexes. Elsevier, vol 27, pp 81–104
Hamilton WL, Ying R, Leskovec J (2017) Inductive representation learning on large graphs. In Proceedings of the 31st international conference on neural information processing systems. NIPS’17, Red Hook, NY, USA, pp 1025-1035. Curran Associates Inc
Hammouda KM, Matute DN, Kamel MS (2005) Corephrase: Keyphrase extraction for document clustering. In International workshop on machine learning and data mining in pattern recognition. Springer, pp 265–274
Hasan KS, Ng V (2014) Automatic keyphrase extraction: A survey of the state of the art. In Proceedings of the 52nd annual meeting of the association for computational linguistics (volume 1: Long Papers), pp 1262–1273
Huang Z, Xu W, Yu K (2015) Bidirectional lstm-crf models for sequence tagging
Hulth A (2003) Improved automatic keyword extraction given more linguistic knowledge. In Proceedings of the 2003 conference on empirical methods in natural language processing. EMNLP ’03, USA, pp 216–223. Association for Computational Linguistics
Hulth A, Megyesi B (2006) A study on automatically extracted keywords in text categorization. In Proceedings of the 21st international conference on computational linguistics and 44th annual meeting of the association for computational linguistics, pp 537–544
Jones S, Staveley MS (1999) Phrasier: a system for interactive document retrieval using keyphrases. In Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval, pp 160–167
Kim SN, Kan MY (2009) Re-examining automatic keyphrase extraction approaches in scientific articles. In Proceedings of the workshop on multiword expressions: identification, interpretation, disambiguation and applications (MWE 2009). Singapore, pp 9–16. Association for Computational Linguistics
Kim SN, Medelyan O, Kan MY, Baldwin T (2010) Semeval-2010 task 5: Automatic keyphrase extraction from scientific articles. In Proceedings of the 5th international workshop on semantic evaluation. SemEval ’10, USA, pp 21–26. Association for Computational Linguistics
Kipf TN, Welling M (2017) Semi-supervised classification with graph convolutional networks. In International conference on learning representations
Kulkarni M, Mahata D, Arora R, Bhowmik R (2022) Learning rich representation of keyphrases from text. In Findings of the association for computational linguistics: NAACL 2022. Seattle, United States, pp 891–906. Association for Computational Linguistics
Lee J, Yoon W, Kim S, Kim D, Kim S, So CH, Kang J (2019) Biobert: a pre-trained biomedical language representation model for biomedical text mining. vol 36, pp 1234–1240
Lo K, Wang LL, Neumann M, Kinney R, Weld D (2020) S2ORC: The semantic scholar open research corpus. In Proceedings of the 58th annual meeting of the association for computational linguistics, Online. Association for Computational Linguistics, pp 4969–4983
Mahata D, Kuriakose J, Shah R, Zimmermann R (2018a) Key2vec: Automatic ranked keyphrase extraction from scientific articles using phrase embeddings. In Proceedings of the 2018 conference of the north american chapter of the association for computational linguistics: human language technologies, volume 2 (Short Papers), pp 634–639
Mahata D, Shah RR, Kuriakose J, Zimmermann R, Talburt JR (2018b) Theme-weighted ranking of keywords from text documents using phrase embeddings. pp 184–189
Martinc M, Škrlj B, Pollak S (2021) TNT-KID: Transformer-based neural tagger for keyword identification. Cambridge University Press (CUP), vol 28, pp 409–448
May MC, Neidhöfer J, Körner T, Schäfer L, Lanza G (2022) Applying natural language processing in manufacturing. 10th CIRP Global Web Conference – Material Aspects of Manufacturing Processes, vol 115, pp 184–189
Meng R, Mahata D, Boudin F (2022) From fundamentals to recent advances: A tutorial on keyphrasification. In Advances in information retrieval: 44th european conference on IR research, ECIR 2022, Stavanger, Norway, April 10–14, 2022, Proceedings, Part II, pp 582–588. Springer
Miaschi A, Dell’Orletta F (2020) Contextual and non-contextual word embeddings: an in-depth linguistic investigation. In Gella S, Welbl J, Rei M, Petroni F, Lewis P, Strubell E, Seo M, Hajishirzi H (eds) Proceedings of the 5th workshop on representation learning for NLP, Online. Association for Computational Linguistics, pp 110–119
Mienye E, Jere N, Obaido G, Mienye ID (2024) Aruleba K. Deep learning in finance: A survey of applications and techniques. 5:2066–2091
Mihalcea R, Tarau P (2004) TextRank: Bringing order into text. In Proceedings of the 2004 conference on empirical methods in natural language processing. Barcelona, Spain, pp 404–411. Association for Computational Linguistics
Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient estimation of word representations in vector space
Mothe J, Ramiandrisoa F, Rasolomanana M (2018) Automatic keyphrase extraction using graph-based methods. In Proceedings of the 33rd annual ACM symposium on applied computing, pp 728–730
Nguyen TD, Kan MY (2007) Keyphrase extraction in scientific publications. In Goh DH-L, Cao TH, Slvberg IT, Rasmussen E (eds) Asian Digital Libraries. Looking Back 10 Years and Forging New Frontiers, Berlin, Heidelberg, pp 317–326. Springer Berlin Heidelberg
Ni Q, Ji J, Feng K, Zhang Y, Lin D, Zheng J (2024) Data-driven bearing health management using a novel multi-scale fused feature and gated recurrent unit. vol 242, pp 109753
Ni Q, Ji J, Halkon B, Feng K, Nandi AK (2023) Physics-informed residual network (piresnet) for rolling element bearing fault diagnostics. vol 200, pp 110544
Page L, Brin S, Motwani R, Winograd T (1999) The pagerank citation ranking : Bringing order to the web. In The web conference
Park H, Vyas Y, Shah K (2022) Efficient classification of long documents using transformers. In Proceedings of the 60th annual meeting of the association for computational linguistics (volume 2: short papers). Dublin, Ireland, pp 702–709. Association for Computational Linguistics
Park S, Caragea C (2020) Scientific keyphrase identification and classification by pre-trained language models intermediate task transfer learning. In Proceedings of the 28th international conference on computational linguistics, pp 5409–5419
Patel K, Caragea C (2019) Exploring word embeddings in crf-based keyphrase extraction from research papers. In Proceedings of the 10th international conference on knowledge capture, pp 37–44
Pennington J, Socher R, Manning C (2014) GloVe: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). Doha, Qatar, pp 1532–1543. Association for Computational Linguistics
Perozzi B, Al-Rfou R, Skiena S (2014) Deepwalk: Online learning of social representations. In Proceedings of the 20th ACM SIGKDD international conference on knowledge discovery and data mining. KDD ’14, New York, NY, USA, pp 701–710. Association for Computing Machinery
Pham H, Wang G, Lu Y, Florencio D, Zhang C (2022) Understanding long documents with different position-aware attentions
Qazvinian V, Radev D, Özgür A (2010) Citation summarization through keyphrase extraction. In Proceedings of the 23rd international conference on computational linguistics (COLING 2010), pp 895–903
Rousseau F, Vazirgiannis M (2015) Main core retention on graph-of-words for single-document keyword extraction. In: Hanbury A, Kazai G, Rauber A, Fuhr N (eds) Advances in information retrieval, Cham. Springer International Publishing, pp 382–393
Rungta M, Kumar R, Dhaliwal MP, Tiwari H, Vala V (2020) Transkp: Transformer based key-phrase extraction. pp 1–7
Sahrawat D, Mahata D, Zhang H, Kulkarni M, Sharma A, Gosangi R, Stent A, Kumar Y, Shah RR, Zimmermann R (2020) Keyphrase extraction as sequence labeling using contextualized embeddings. In European conference on information retrieval. Springer, pp 328–335
Santosh T, Kumar Sanyal D, Bhowmick PK, Das PP (2020) SaSAKE: Syntax and semantics aware keyphrase extraction from research papers. In Proceedings of the 28th international conference on computational linguistics, Barcelona, Spain (Online). International Committee on Computational Linguistics, pp 5372–5383
Saxena A, Fletcher G, Pechenizkiy M (2021) Nodesim: node similarity based network embedding for diverse link prediction. vol 11
Song IY, Allen RB, Obradovic Z, Song M (2006) Keyphrase extraction-based query expansion in digital libraries. In Proceedings of the 6th ACM/IEEE-CS joint conference on digital libraries (JCDL’06). IEEE, pp 202–209
Stark C, Breitkreutz BJ, Reguly T, Boucher L, Breitkreutz A, Tyers M (2006) Biogrid: a general repository for interaction datasets. vol 34, pp D535–D539
Sun Z, Tang J, Du P, Deng ZH, Nie JY (2019) Divgraphpointer: A graph pointer network for extracting diverse keyphrases. In Proceedings of the 42nd international ACM SIGIR conference on research and development in information retrieval, SIGIR’19, New York, NY, USA, pp 755–764. Association for Computing Machinery
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I (2017) Attention is all you need. In Proceedings of the 31st international conference on neural information processing systems, NIPS’17, Red Hook, NY, USA, pp 6000–6010. Curran Associates Inc
Veličković P, Cucurull G, Casanova A, Romero A, Liò P, Bengio Y (2018) Graph attention networks. In International conference on learning representations
Wan X, Xiao J (2008) Single document keyphrase extraction using neighborhood knowledge. In Proceedings of the 23rd national conference on artificial intelligence - volume 2, AAAI’08, pp 855–860. AAAI Press
Wang P, Agarwal K, Ham C, Choudhury S, Reddy CK (2021) Self-supervised learning of contextual embeddings for link prediction in heterogeneous networks. In Proceedings of the web conference 2021, WWW ’21, New York, NY, USA, pp 2946–2957. Association for Computing Machinery
Wang R, Liu W, McDonald C (2014) Corpus-independent generic keyphrase extraction using word embedding vectors. Softw Eng Res Conf 39:1–8
Xiao C, Liu Z, Lin Y, Sun M (2023) Legal knowledge representation learning. Springer Nature Singapore, Singapore, pp 401–432
Yang L, Zhang M, Li C, Bendersky M, Najork M (2020) Beyond 512 tokens: Siamese multi-depth transformer-based hierarchical encoder for long-form document matching. CIKM ’20, New York, NY, USA, pp 1725–1734. Association for Computing Machinery
Yang Y, UY MCS, Huang A (2020) Finbert: A pretrained language model for financial communications
Yasunaga M, Kasai J, Zhang R, Fabbri AR, Li I, Friedman D, Radev DR (2019) Scisummnet: a large annotated corpus and content-impact models for scientific paper summarization with citation networks. In Proceedings of the thirty-third AAAI conference on artificial intelligence and thirty-first innovative applications of artificial intelligence conference and ninth AAAI symposium on educational advances in artificial intelligence, AAAI’19/IAAI’19/EAAI’19. AAAI Press
Ye J, Cai R, Gui T, Zhang Q (2021) Heterogeneous graph neural networks for keyphrase generation. In Proceedings of the 2021 conference on empirical methods in natural language processing, Online and Punta Cana, Dominican Republic, pp 2705–2715. Association for Computational Linguistics
Ying R, He R, Chen K, Eksombatchai P, Hamilton WL, Leskovec J (2018) Graph convolutional neural networks for web-scale recommender systems. In Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining, KDD ’18, New York, NY, USA, pp 974-983. Association for Computing Machinery
Zaheer M, Guruganesh G, Dubey A, Ainslie J, Alberti C, Ontanon S, Pham P, Ravula A, Wang Q, Yang L, Ahmed A (2020) Big bird: Transformers for longer sequences. In Proceedings of the 34th international conference on neural information processing systems, NIPS’20, Red Hook, NY, USA. Curran Associates Inc
Zhang R, Wei Z, Shi Y, Chen Y (2020) Bert-al: Bert for arbitrarily long document understanding
Zhang Y, Zincir-Heywood N, Milios E (2004) World wide web site summarization. IOS Press, vol 2, pp 39–53
Zhong K, Jackson T, West A, Cosma G (2024) Natural language processing approaches in industrial maintenance: A systematic literature review. 5th International Conference on Industry 4.0 and Smart Manufacturing (ISM 2023), vol 232, pp 2082–2097
Acknowledgements
In the course of our research, we would like to acknowledge the insightful discussions and ideas generated from the preprint version of our own work, titled ’Enhancing KPE from Long Scientific Documents using Graph Embeddings’ authored by Roberto Martínez-Cruz, Debanjan Mahata, Alvaro J. López-López and José Portela. This preprint was made available on arXiv (https://arxiv.org/abs/2305.09316) in May 2023.
Author information
Roberto Martínez-Cruz and Debanjan Mahata contributed equally to this work.
Authors and Affiliations
Institute for Research in Technology, ICAI School of Engineering, Comillas Pontifical University, Madrid, Spain
Roberto Martínez-Cruz, Alvaro J. López-López & José Portela
Bloomberg, New York, NY, USA
Debanjan Mahata
- Roberto Martínez-Cruz
Search author on:PubMed Google Scholar
- Debanjan Mahata
Search author on:PubMed Google Scholar
- Alvaro J. López-López
Search author on:PubMed Google Scholar
- José Portela
Search author on:PubMed Google Scholar
Contributions
Roberto Martínez-Cruz and Debanjan Mahata contributed equally to the conception and design of the study, as well as the development of the methodology and analysis of data. Alvaro J. López-López and José Portela contributed to the collection and assembly of data, as well as providing critical revisions that added important intellectual content. All authors participated in drafting the manuscript and approved the final version to be published.
Corresponding author
Correspondence toRoberto Martínez-Cruz.
Ethics declarations
Competing Interests
The authors declare no competing interests related to this study. All methodologies, analyses, and interpretations of data were conducted independently and without influence from external entities. This research was purely academic and aimed at contributing to the existing body of knowledge on keyphrase extraction from long documents using graph embeddings.
Ethical Approval and Informed Consent
This research did not involve any human participants, data, or tissues, and therefore did not require ethical approval or informed consent. All datasets utilized in this study are publicly available and were openly published for research purposes, adhering to all applicable ethical standards.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Martínez-Cruz, R., Mahata, D., López-López, A.J.et al. Enhancing keyphrase extraction from long scientific documents using graph embeddings.Appl Intell55, 711 (2025). https://doi.org/10.1007/s10489-025-06579-y
Accepted:
Published:
Share this article
Anyone you share the following link with will be able to read this content:
Sorry, a shareable link is not currently available for this article.
Provided by the Springer Nature SharedIt content-sharing initiative
Keywords
Profiles
- Roberto Martínez-CruzView author profile