Part of the book series:Communications in Computer and Information Science ((CCIS,volume 1905))
Included in the following conference series:
168Accesses
Abstract
Being able to extract from scientific papers their main points, key insights, and other important information, referred to here as aspects, might facilitate the process of conducting a scientific literature review. Therefore, the aim of our research is to create a tool for automatic aspect extraction from Russian-language scientific texts of any domain. In this paper, we present a cross-domain dataset of scientific texts in Russian, annotated with such aspects as Task, Contribution, Method, and Conclusion, as well as a baseline algorithm for aspect extraction, based on the multilingual BERT model fine-tuned on our data. We show that there are some differences in aspect representation in different domains, but even though our model was trained on a limited number of scientific domains, it is still able to generalize to new domains, as was proved by cross-domain experiments. The code and the dataset are available athttps://github.com/anna-marshalova/automatic-aspect-extraction-from-scientific-texts.
This is a preview of subscription content,log in via an institution to check access.
Access this chapter
Subscribe and save
- Get 10 units per month
- Download Article/Chapter or eBook
- 1 Unit = 1 Article or 1 Chapter
- Cancel anytime
Buy Now
- Chapter
- JPY 3498
- Price includes VAT (Japan)
- eBook
- JPY 12583
- Price includes VAT (Japan)
- Softcover Book
- JPY 10724
- Price includes VAT (Japan)
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
Population/Problem (P), Intervention (I), Comparison (C) and Outcome (O).
- 2.
The texts are originally in Russian and were translated into English only to provide examples in the paper.
- 3.
References
Augenstein, I., Das, M., Riedel, S., Vikraman, L., McCallum, A.: SemEval 2017 task 10: ScienceIE - extracting keyphrases and relations from scientific publications. In: Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017), Vancouver, Canada, pp. 546–555. Association for Computational Linguistics (2017)
Batura, T., Bakiyeva, A., Charintseva, M.: A method for automatic text summarization based on rhetorical analysis and topic modeling. Int. J. Comput.19(1), 118–127 (2020)
Beltagy, I., Lo, K., Cohan, A.: SciBERT: a pretrained language model for scientific text. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, pp. 3615–3620. Association for Computational Linguistics (2019)
Bird, S., Klein, E., Loper, E.: Natural Language Processing with Python: Analyzing Text with the Natural Language Toolkit. O’Reilly Media, Inc. (2009)
Blinov, P., Reshetnikova, A., Nesterov, A., Zubkova, G., Kokh, V.: Rumedbench: a Russian medical language understanding benchmark. In: Michalowski, M., Abidi, S.S.R., Abidi, S. (eds) AIME 2022. LNCS, vol. 13263, pp. 383–392. Springer, Cham (2022).https://doi.org/10.1007/978-3-031-09342-5_38
Boudin, F., Nie, J.Y., Bartlett, J.C., Grad, R., Pluye, P., Dawes, M.: Combining classifiers for robust PICO element detection. BMC Med. Inform. Decis. Mak.10(1), 1–6 (2010)
Bruches, E., Pauls, A., Batura, T., Isachenko, V.: Entity recognition and relation extraction from scientific and technical texts in Russian. In: 2020 Science and Artificial Intelligence Conference (SAI ence), pp. 41–45. IEEE (2020)
Dernoncourt, F., Lee, J.Y.: PubMed 200k RCT: a dataset for sequential sentence classification in medical abstracts. In: Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 2: Short Papers). pp. 308–313, Taipei, Taiwan. Asian Federation of Natural Language Processing (2017)
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, pp. 4171–4186. Association for Computational Linguistics (2019)
Dudchenko, A., Dudchenko, P., Ganzinger, M., Kopanitsa, G.D.: Extraction from medical records. In: pHealth, pp. 62–67 (2019)
Gavrilov, D., Gusev, A., Korsakov, I., Novitsky, R., Serova, L.: Feature extraction method from electronic health records in Russia. In: Conference of Open Innovations Association, FRUCT, pp. 497–500. FRUCT Oy (2020)
Gerasimenko, N., Chernyavsky, A., Nikiforova, M.: ruSciBERT: a transformer language model for obtaining semantic embeddings of scientific texts in Russian. Doklady Mathematics106, S95–S96 (2022)
Gonçalves, S., Cortez, P., Moro, S.: A deep learning classifier for sentence classification in biomedical and computer science abstracts. Neural Comput. Appl.32, 6793–6807 (2020)
Gupta, S., Manning, C.: Analyzing the dynamics of research by extracting key aspects of scientific papers. In: Proceedings of 5th International Joint Conference on Natural Language Processing. Asian Federation of Natural Language Processing, Chiang Mai, Thailand, pp. 1–9 (2011)
Hassanzadeh, H., Groza, T., Hunter, J.: Identifying scientific artefacts in biomedical literature: the evidence based medicine use case. J. Biomed. Inform.49, 159–170 (2014)
Honnibal, M., Montani, I.: spaCy 2: natural language understanding with Bloom embeddings, convolutional neural networks and incremental parsing (2017), to appear
Hripcsak, G., Rothschild, A.S.: Agreement, the f-measure, and reliability in information retrieval. J. Am. Med. Inform. Assoc.12(3), 296–298 (2005)
Huang, T.H.K., Huang, C.Y., Ding, C.K.C., Hsu, Y.C., Giles, C.L.: CODA-19: using a non-expert crowd to annotate research aspects on 10,000+ abstracts in the COVID-19 open research dataset. In: Proceedings of the 1st Workshop on NLP for COVID-19 at ACL 2020. Association for Computational Linguistics (2020)
Jain, S., van Zuylen, M., Hajishirzi, H., Beltagy, I.: SciREX: a challenge dataset for document-level information extraction. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 7506–7516. Association for Computational Linguistics (2020)
Kim, S.N., Martinez, D., Cavedon, L., Yencken, L.: Automatic classification of sentences to support evidence based medicine. BMC Bioinformatics, vol. 12, pp. 1–10. BioMed Central (2011)
Kivotova, E., Maksudov, B., Kuleev, R., Ibragimov, B.: Extracting clinical information from chest x-ray reports: a case study for Russian language. In: 2020 International Conference Nonlinearity, Information and Robotics (NIR), pp. 1–6. IEEE (2020)
Korobov, M.: Morphological analyzer and generator for Russian and Ukrainian languages. In: Khachay, M.Y., Konstantinova, N., Panchenko, A., Ignatov, D.I., Labunets, V.G. (eds.) AIST 2015. CCIS, vol. 542, pp. 320–332. Springer, Cham (2015).https://doi.org/10.1007/978-3-319-26123-2_31
Kuratov, Y., Arkhipov, M.: Adaptation of deep bidirectional multilingual transformers for Russian language. In: Computational Linguistics and Intellectual Technologies Papers from the Annual International Conference «Dialog», Moscow, May 29 – June 1, 2019, Proceedings, pp. 333–339 (2019)
Loukachevitch, N., et al.: Nerel-bio: a dataset of biomedical abstracts annotated with nested named entities. Bioinformatics39(4), btad161 (2023)
Miftahutdinov, Z., Alimova, I., Tutubalina, E.: On biomedical named entity recognition: experiments in interlingual transfer for clinical and social media texts. In: Jose, J.M., et al. (eds.) ECIR 2020. LNCS, vol. 12036, pp. 281–288. Springer, Cham (2020).https://doi.org/10.1007/978-3-030-45442-5_35
Nasar, Z., Jaffry, S.W., Malik, M.K.: Information extraction from scientific articles: a survey. Scientometrics117, 1931–1990 (2018)
Nesterov, A., et al.: RuCCoN: clinical concept normalization in Russian. In: Findings of the Association for Computational Linguistics: ACL 2022, Dublin, Ireland, pp. 239–245. Association for Computational Linguistics (2022)
Noreen, E.W.: Computer-Intensive Methods for Testing Hypotheses. Wiley, New York (1989)
Ronzano, F., Saggion, H.: Dr. inventor framework: extracting structured information from scientific publications. In: Japkowicz, N., Matwin, S. (eds.) DS 2015. LNCS (LNAI), vol. 9356, pp. 209–220. Springer, Cham (2015).https://doi.org/10.1007/978-3-319-24282-8_18
Shang, X., Ma, Q., Lin, Z., Yan, J., Chen, Z.: A span-based dynamic local attention model for sequential sentence classification. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), pp. 198–203. Association for Computational Linguistics (2021)
Shelmanov, A., Smirnov, I., Vishneva, E.: Information extraction from clinical texts in Russian. In: Computational Linguistics and Intellectual Technologies Papers from the Annual International Conference «Dialog», Moscow, May 27–30, 2015, Proceedings, pp. 560–572 (2015)
Sirotina, A., Loukachevitch, N.: Named entity recognition in information security domain for Russian. In: Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019), pp. 1114–1120 (2019)
Skvortsova, I.A.: Russian language among the world languages. In: VIII Vinogradov Conference, pp. 171–173 (2022)
Teufel, S., et al.: Argumentative zoning: information extraction from scientific text. Ph.D. thesis, Citeseer (1999)
Tikhomirov, M., Loukachevitch, N., Sirotina, A., Dobrov, B.: Using BERT and augmentation in named entity recognition for cybersecurity domain. In: Métais, E., Meziane, F., Horacek, H., Cimiano, P. (eds.) NLDB 2020. LNCS, vol. 12089, pp. 16–24. Springer, Cham (2020).https://doi.org/10.1007/978-3-030-51310-8_2
Yamada, K., Hirao, T., Sasano, R., Takeda, K., Nagata, M.: Sequential span classification with neural semi-Markov CRFs for biomedical abstracts. In: Findings of the Association for Computational Linguistics: EMNLP 2020, pp. 871–877. Association for Computational Linguistics, Online (2020)
Zhang, C., Xiang, Y., Hao, W., Li, Z., Qian, Y., Wang, Y.: Automatic recognition and classification of future work sentences from academic articles in a specific domain. J. Informet.17(1), 101373 (2023)
Author information
Authors and Affiliations
Novosibirsk State University, Novosibirsk, Russia
Anna Marshalova & Elena Bruches
A. P. Ershov Institute of Informatics Systems, Novosibirsk, Russia
Elena Bruches & Tatiana Batura
- Anna Marshalova
You can also search for this author inPubMed Google Scholar
- Elena Bruches
You can also search for this author inPubMed Google Scholar
- Tatiana Batura
You can also search for this author inPubMed Google Scholar
Corresponding author
Correspondence toAnna Marshalova.
Editor information
Editors and Affiliations
National Research University Higher School of Economics, Moscow, Russia
Dmitry I. Ignatov
Krasovskii Institute of Mathematics and Mechanics of Russian Academy of Sciences, Yekaterinburg, Russia
Michael Khachay
University of Oslo, Oslo, Norway
Andrey Kutuzov
American University of Armenia, Yerevan, Armenia
Habet Madoyan
Artificial Intelligence Research Institute, Moscow, Russia
Ilya Makarov
Universität Hamburg, Hamburg, Germany
Irina Nikishina
Skolkovo Institute of Science and Technology, Moscow, Russia
Alexander Panchenko
Mohamed bin Zayed University of Artificial Intelligence and Technology Innovation Institute, Abu Dhabi, United Arab Emirates
Maxim Panov
Industrial and Systems Engineering, University of Florida, Gainesville, FL, USA
Panos M. Pardalos
National Research University Higher School of Economics, Nizhny Novgorod, Russia
Andrey V. Savchenko
Apptek, Aachen, Nordrhein-Westfalen, Germany
Evgenii Tsymbalov
Kazan Federal University and HSE University, Moscow, Russia
Elena Tutubalina
MTS AI, Moscow, Russia
Sergey Zagoruyko
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Marshalova, A., Bruches, E., Batura, T. (2024). Automatic Aspect Extraction from Scientific Texts. In: Ignatov, D.I.,et al. Recent Trends in Analysis of Images, Social Networks and Texts. AIST 2023. Communications in Computer and Information Science, vol 1905. Springer, Cham. https://doi.org/10.1007/978-3-031-67008-4_6
Download citation
Published:
Publisher Name:Springer, Cham
Print ISBN:978-3-031-67007-7
Online ISBN:978-3-031-67008-4
eBook Packages:Computer ScienceComputer Science (R0)
Share this paper
Anyone you share the following link with will be able to read this content:
Sorry, a shareable link is not currently available for this article.
Provided by the Springer Nature SharedIt content-sharing initiative