- Alessia Cioffi ORCID:orcid.org/0000-0002-9812-406514 &
- Silvio Peroni ORCID:orcid.org/0000-0003-0530-430515,16
Part of the book series:Lecture Notes in Computer Science ((LNCS,volume 13541))
Included in the following conference series:
1441Accesses
Abstract
Many solutions have been provided to extract bibliographic references from PDF papers. Machine learning, rule-based and regular expressions approaches were among the most used methods adopted in tools for addressing this task. This work aims to identify and evaluate all and only the tools which, given a full-text paper in PDF format, can recognise, extract and parse bibliographic references. We identified seven tools: Anystyle, Cermine, ExCite, Grobid, Pdfssa4met, Scholarcy and Science Parse. We compared and evaluated them against a corpus of 56 PDF articles published in 27 subject areas. Indeed, Anystyle obtained the best overall score, followed by Cermine. However, in some subject areas, other tools had better results for specific tasks.
This is a preview of subscription content,log in via an institution to check access.
Access this chapter
Subscribe and save
- Get 10 units per month
- Download Article/Chapter or eBook
- 1 Unit = 1 Article or 1 Chapter
- Cancel anytime
Buy Now
- Chapter
- JPY 3498
- Price includes VAT (Japan)
- eBook
- JPY 9151
- Price includes VAT (Japan)
- Softcover Book
- JPY 11439
- Price includes VAT (Japan)
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Azimjonov, J., Alikhanov, J.: Rule based metadata extraction framework from academic articles.arXiv:1807.09009 [Cs] (2018)
Bhardwaj, A., Mercier, D., Dengel, A., Ahmed, S.: DeepBIBX: deep learning for image based bibliographic data extraction. In: Liu, D., Xie, S., Li, Y., Zhao, D., El-Alfy, E.-S. M. (eds.) Neural Information Processing, pp. 286–293. Springer International Publishing, Cham (2017).https://doi.org/10.1007/978-3-319-70096-0_30
Cioffi, A.: Code for converting different formats to TEI XML and evaluation of the results. Zenodo (2022).https://doi.org/10.5281/zenodo.6182128
Cioffi, A.: Data for testing and evaluating references extraction and parsing tools. Zenodo (2022).https://doi.org/10.5281/zenodo.6182066
Cioffi, A.: Systematic literature review about software for references extraction. protocols.io (2022).https://doi.org/10.17504/protocols.io.buz9nx96
Cohen, W.W., Ravikumar, P., Fienberg, S.E.: A comparison of string distance metrics for name-matching tasks. In: IIWEB 2003: Proceedings of the 2003 International Conference on Information Integration on the Web (2003).https://doi.org/10.5555/3104278.3104293
Fortunato, S., et al.: Science of science. Science359(6379), aao0185 (2018).https://doi.org/10.1126/science.aao0185
Gooch, P.: How Scholarcy contributes to and makes use of open citations. Scholarcy (2021).https://www.scholarcy.com/how-scholarcy-contributes-to-and-makes-use-of-opencitations/
Hetzner, E.: A simple method for citation metadata extraction using hidden Markov models. In: Proceedings of the 8th ACM/IEEE-CS Joint Conference on Digital Libraries - JCDL 2008, p. 280. Pittsburgh PA, PA, USA: ACM Press (2008)
Hsieh, Y.L., et al.: A frame-based approach for reference metadata extraction. In: Cheng, S.M., Day, M.Y. (eds.) Technologies and Applications of Artificial Intelligence. LNCS, vol. 8916, pp. 154–163. Springer, Cham (2014).https://doi.org/10.1007/978-3-319-13987-6_15
Huynh, T., Hoang, K.: GATE framework based metadata extraction from scientific papers. In: 2010 International Conference on Education and Management Technology, pp. 188–191. Cairo, Egypt. IEEE (2010).https://doi.org/10.1109/ICEMT.2010.5657675
Indrawati, A., Yoganingrum, A., Yuwono, P.: Evaluating the quality of the indonesian scientific journal references using ParsCit, CERMINE and GROBID. Lib. Philos. Pract. (2019)
Khabsa, M., Giles, C.L.: The number of scholarly documents on the public web. PLoS ONE9(5), e93949 (2014).https://doi.org/10.1371/journal.pone.0093949
Kim, K., Chung, Y.: Overview of Journal Metrics. Sci. Editing5(1), 16–20 (2018).https://doi.org/10.6087/kcse.112
King, D., Jérome, D., Van Allen, M., Shepherd, P., Bollen, J.: Tools and metrics: keynote speech. Inf. Serv. Use28(3–4), 215–28 (2009).https://doi.org/10.3233/ISU-2008-0579
Kluegl, P., Hotho, A., Puppe, F.: Local adaptive extraction of references. In: Dillmann, R., Beyerer, J., Hanebeck, U.D., Schultz, T. (eds.) KI 2010. LNCS (LNAI), vol. 6359, pp. 40–47. Springer, Heidelberg (2010).https://doi.org/10.1007/978-3-642-16111-7_4
Körner, M., Ghavimi, B., Mayr, P., Hartmann, H., Staab, S.: Evaluating reference string extraction using line-based conditional random fields: a case study with German language publications. In: Kirikova, M., et al. (eds.) ADBIS 2017. CCIS, vol. 767, pp. 137–145. Springer, Cham (2017).https://doi.org/10.1007/978-3-319-67162-8_15
Lecy, J.D., Kate, E.: Beatty: representative literature reviews using constrained snowball sampling and citation network analysis. SSRN Electron. J. (2012)https://doi.org/10.2139/ssrn.1992601
Levene, M.: An Introduction to Search Engines and Web Navigation, 2nd edn. John Wiley, Hoboken (2010)
Lopez, P.: GROBID: combining automatic bibliographic data recognition and term extraction for scholarship publications. In: Agosti, M., Borbinha, J., Kapidakis, S., Papatheodorou, C., Tsakonas, G. (eds.) Research and Advanced Technology for Digital Libraries. Lecture Notes in Computer Science, vol. 5714, pp. 473–474. Springer, Berlin (2009).https://doi.org/10.1007/978-3-642-04346-8_62
Ning, X., Jin, H., Wu, H.: SemreX: towards large-scale literature information retrieval and browsing with semantic association. In: 2006 IEEE International Conference on E-Business Engineering (ICEBE 2006), pp. 602–609. Shanghai, China. IEEE (2006).https://doi.org/10.1109/ICEBE.2006.87
Ojokoh, B., Zhang, M., Tang, J.: A Trigram hidden Markov model for metadata extraction from heterogeneous references. Inf. Sci.181(9), 1538–1551 (2011).https://doi.org/10.1016/j.ins.2011.01.014
Peng, F., Andrew M.: Accurate information extraction from research papers using conditional random fields. In: NAACL (2004)
Santos, E.A.D., Peroni, S., Mucheroni, M.L.: The way we cite: common metadata used across disciplines for defining bibliographic references. In: Proceedings of the 26th International Conference on Theory and Practice of Digital Libraries (TPDL 2022). arXiv.org (2022, to appear).https://doi.org/10.48550/arXiv.2202.08469
Suryawati, E., Widyantoro, D.H.: Combination of heuristic, rule-based and machine learning for bibliography extraction. In: 2017 5th International Conference on Instrumentation, Communications, Information Technology, and Biomedical Engineering (ICICI-BME), pp. 276–81, Bandung. IEEE (2017).https://doi.org/10.1109/ICICI-BME.2017.8537772
Tkaczyk, D., Szostek, P., Dendek, P.J., Fedoryszak, M., Bolikowski, L.: CERMINE -- automatic extraction of metadata and references from scientific literature. In: 2014 11th IAPR International Workshop on Document Analysis Systems, pp. 217– 21. IEEE (2014).https://doi.org/10.1109/DAS.2014.63
Tkaczyk, D., Collins, A., Sheridan, P., Beel, J.: Evaluation and comparison of open source bibliographic reference parsers: a business use case.arXiv:1802.01168 (2018)
Tkaczyk, D., Collins, A., Sheridan, P., Beel, J.: Machine learning vs. rules and out-of-the-box vs. retrained: an evaluation of open-source bibliographic reference and citation parsers. In: Proceedings of the 18th ACM/IEEE on Joint Conference on Digital Libraries, pp. 99–108. Fort Worth Texas USA. ACM (2018)
Van Noorden, R.: Global scientific output doubles every nine years. nature news blog (2014).http://blogs.nature.com/news/2014/05/global-scientific-output-doublesevery-nine-years.html
Wohlin, C.: Guidelines for snowballing in systematic literature studies and a replication in software engineering. In: Proceedings of the 18th International Conference on Evaluation and Assessment in Software Engineering - EASE 2014 (2014)
Xiao, Y., Watson, M.: Guidance on conducting a systematic literature review. J. Plan. Educ. Res.39(1), 93–112 (2019)
Yin, P., Zhang, M., Deng, Z., Yang, D.: Metadata extraction from bibliographies using bigram HMM. In: Chen, Z., Chen, H., Miao, Q., Fu, Y., Fox, E., Lim, E.-P. (eds.) ICADL 2004. LNCS, vol. 3334, pp. 310–319. Springer, Heidelberg (2004).https://doi.org/10.1007/978-3-540-30544-6_33
Zhang, X., Zou, J., Le, D.X., Thoma, G.R.: A structural SVM approach for reference parsing. BMC Bioinform.12(S3), S7 (2011).https://doi.org/10.1186/1471-2105-12-S3-S7
Acknowledgements
The work of Silvio Peroni has been partially funded by the European Union’s Horizon 2020 research and innovation program under grant agreement No 101017452 (OpenAIRE-Nexus).
Author information
Authors and Affiliations
Digital Humanities and Digital Knowledge, Department of Classical Philology and Italian Studies, University of Bologna, Bologna, Italy
Alessia Cioffi
Research Centre for Open Scholarly Metadata, Department of Classical Philology and Italian Studies, University of Bologna, Bologna, Italy
Silvio Peroni
Digital Humanities Advanced Research Centre (/DH.arc), Department of Classical Philology and Italian Studies, University of Bologna, Bologna, Italy
Silvio Peroni
- Alessia Cioffi
You can also search for this author inPubMed Google Scholar
- Silvio Peroni
You can also search for this author inPubMed Google Scholar
Corresponding author
Correspondence toSilvio Peroni.
Editor information
Editors and Affiliations
University of Padua, Padua, Italy
Gianmaria Silvello
Universidad Politécnica de Madrid, Madrid, Spain
Oscar Corcho
CNR-ISTI – National Research Council, Pisa, Italy
Paolo Manghi
University of Padua, Padua, Italy
Giorgio Maria Di Nunzio
Linnaeus University, Växjö, Sweden
Koraljka Golub
University of Padua, Padua, Italy
Nicola Ferro
Sapienza University of Rome, Rome, Italy
Antonella Poggi
Rights and permissions
Copyright information
© 2022 Springer Nature Switzerland AG
About this paper
Cite this paper
Cioffi, A., Peroni, S. (2022). Structured References from PDF Articles: Assessing the Tools for Bibliographic Reference Extraction and Parsing. In: Silvello, G.,et al. Linking Theory and Practice of Digital Libraries. TPDL 2022. Lecture Notes in Computer Science, vol 13541. Springer, Cham. https://doi.org/10.1007/978-3-031-16802-4_42
Download citation
Published:
Publisher Name:Springer, Cham
Print ISBN:978-3-031-16801-7
Online ISBN:978-3-031-16802-4
eBook Packages:Computer ScienceComputer Science (R0)
Share this paper
Anyone you share the following link with will be able to read this content:
Sorry, a shareable link is not currently available for this article.
Provided by the Springer Nature SharedIt content-sharing initiative