Movatterモバイル変換


[0]ホーム

URL:


Skip to main content

Advertisement

Springer Nature Link
Log in

Structured References from PDF Articles: Assessing the Tools for Bibliographic Reference Extraction and Parsing

  • Conference paper
  • First Online:

Abstract

Many solutions have been provided to extract bibliographic references from PDF papers. Machine learning, rule-based and regular expressions approaches were among the most used methods adopted in tools for addressing this task. This work aims to identify and evaluate all and only the tools which, given a full-text paper in PDF format, can recognise, extract and parse bibliographic references. We identified seven tools: Anystyle, Cermine, ExCite, Grobid, Pdfssa4met, Scholarcy and Science Parse. We compared and evaluated them against a corpus of 56 PDF articles published in 27 subject areas. Indeed, Anystyle obtained the best overall score, followed by Cermine. However, in some subject areas, other tools had better results for specific tasks.

This is a preview of subscription content,log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
¥17,985 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
JPY 3498
Price includes VAT (Japan)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
JPY 9151
Price includes VAT (Japan)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
JPY 11439
Price includes VAT (Japan)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide -see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Similar content being viewed by others

References

  1. Azimjonov, J., Alikhanov, J.: Rule based metadata extraction framework from academic articles.arXiv:1807.09009 [Cs] (2018)

  2. Bhardwaj, A., Mercier, D., Dengel, A., Ahmed, S.: DeepBIBX: deep learning for image based bibliographic data extraction. In: Liu, D., Xie, S., Li, Y., Zhao, D., El-Alfy, E.-S. M. (eds.) Neural Information Processing, pp. 286–293. Springer International Publishing, Cham (2017).https://doi.org/10.1007/978-3-319-70096-0_30

    Chapter  Google Scholar 

  3. Cioffi, A.: Code for converting different formats to TEI XML and evaluation of the results. Zenodo (2022).https://doi.org/10.5281/zenodo.6182128

  4. Cioffi, A.: Data for testing and evaluating references extraction and parsing tools. Zenodo (2022).https://doi.org/10.5281/zenodo.6182066

  5. Cioffi, A.: Systematic literature review about software for references extraction. protocols.io (2022).https://doi.org/10.17504/protocols.io.buz9nx96

  6. Cohen, W.W., Ravikumar, P., Fienberg, S.E.: A comparison of string distance metrics for name-matching tasks. In: IIWEB 2003: Proceedings of the 2003 International Conference on Information Integration on the Web (2003).https://doi.org/10.5555/3104278.3104293

  7. Fortunato, S., et al.: Science of science. Science359(6379), aao0185 (2018).https://doi.org/10.1126/science.aao0185

    Article  Google Scholar 

  8. Gooch, P.: How Scholarcy contributes to and makes use of open citations. Scholarcy (2021).https://www.scholarcy.com/how-scholarcy-contributes-to-and-makes-use-of-opencitations/

  9. Hetzner, E.: A simple method for citation metadata extraction using hidden Markov models. In: Proceedings of the 8th ACM/IEEE-CS Joint Conference on Digital Libraries - JCDL 2008, p. 280. Pittsburgh PA, PA, USA: ACM Press (2008)

    Google Scholar 

  10. Hsieh, Y.L., et al.: A frame-based approach for reference metadata extraction. In: Cheng, S.M., Day, M.Y. (eds.) Technologies and Applications of Artificial Intelligence. LNCS, vol. 8916, pp. 154–163. Springer, Cham (2014).https://doi.org/10.1007/978-3-319-13987-6_15

    Chapter  Google Scholar 

  11. Huynh, T., Hoang, K.: GATE framework based metadata extraction from scientific papers. In: 2010 International Conference on Education and Management Technology, pp. 188–191. Cairo, Egypt. IEEE (2010).https://doi.org/10.1109/ICEMT.2010.5657675

  12. Indrawati, A., Yoganingrum, A., Yuwono, P.: Evaluating the quality of the indonesian scientific journal references using ParsCit, CERMINE and GROBID. Lib. Philos. Pract. (2019)

    Google Scholar 

  13. Khabsa, M., Giles, C.L.: The number of scholarly documents on the public web. PLoS ONE9(5), e93949 (2014).https://doi.org/10.1371/journal.pone.0093949

    Article  Google Scholar 

  14. Kim, K., Chung, Y.: Overview of Journal Metrics. Sci. Editing5(1), 16–20 (2018).https://doi.org/10.6087/kcse.112

    Article  Google Scholar 

  15. King, D., Jérome, D., Van Allen, M., Shepherd, P., Bollen, J.: Tools and metrics: keynote speech. Inf. Serv. Use28(3–4), 215–28 (2009).https://doi.org/10.3233/ISU-2008-0579

    Article  Google Scholar 

  16. Kluegl, P., Hotho, A., Puppe, F.: Local adaptive extraction of references. In: Dillmann, R., Beyerer, J., Hanebeck, U.D., Schultz, T. (eds.) KI 2010. LNCS (LNAI), vol. 6359, pp. 40–47. Springer, Heidelberg (2010).https://doi.org/10.1007/978-3-642-16111-7_4

    Chapter  Google Scholar 

  17. Körner, M., Ghavimi, B., Mayr, P., Hartmann, H., Staab, S.: Evaluating reference string extraction using line-based conditional random fields: a case study with German language publications. In: Kirikova, M., et al. (eds.) ADBIS 2017. CCIS, vol. 767, pp. 137–145. Springer, Cham (2017).https://doi.org/10.1007/978-3-319-67162-8_15

    Chapter  Google Scholar 

  18. Lecy, J.D., Kate, E.: Beatty: representative literature reviews using constrained snowball sampling and citation network analysis. SSRN Electron. J. (2012)https://doi.org/10.2139/ssrn.1992601

  19. Levene, M.: An Introduction to Search Engines and Web Navigation, 2nd edn. John Wiley, Hoboken (2010)

    Book  Google Scholar 

  20. Lopez, P.: GROBID: combining automatic bibliographic data recognition and term extraction for scholarship publications. In: Agosti, M., Borbinha, J., Kapidakis, S., Papatheodorou, C., Tsakonas, G. (eds.) Research and Advanced Technology for Digital Libraries. Lecture Notes in Computer Science, vol. 5714, pp. 473–474. Springer, Berlin (2009).https://doi.org/10.1007/978-3-642-04346-8_62

    Chapter  Google Scholar 

  21. Ning, X., Jin, H., Wu, H.: SemreX: towards large-scale literature information retrieval and browsing with semantic association. In: 2006 IEEE International Conference on E-Business Engineering (ICEBE 2006), pp. 602–609. Shanghai, China. IEEE (2006).https://doi.org/10.1109/ICEBE.2006.87

  22. Ojokoh, B., Zhang, M., Tang, J.: A Trigram hidden Markov model for metadata extraction from heterogeneous references. Inf. Sci.181(9), 1538–1551 (2011).https://doi.org/10.1016/j.ins.2011.01.014

    Article  Google Scholar 

  23. Peng, F., Andrew M.: Accurate information extraction from research papers using conditional random fields. In: NAACL (2004)

    Google Scholar 

  24. Santos, E.A.D., Peroni, S., Mucheroni, M.L.: The way we cite: common metadata used across disciplines for defining bibliographic references. In: Proceedings of the 26th International Conference on Theory and Practice of Digital Libraries (TPDL 2022). arXiv.org (2022, to appear).https://doi.org/10.48550/arXiv.2202.08469

  25. Suryawati, E., Widyantoro, D.H.: Combination of heuristic, rule-based and machine learning for bibliography extraction. In: 2017 5th International Conference on Instrumentation, Communications, Information Technology, and Biomedical Engineering (ICICI-BME), pp. 276–81, Bandung. IEEE (2017).https://doi.org/10.1109/ICICI-BME.2017.8537772

  26. Tkaczyk, D., Szostek, P., Dendek, P.J., Fedoryszak, M., Bolikowski, L.: CERMINE -- automatic extraction of metadata and references from scientific literature. In: 2014 11th IAPR International Workshop on Document Analysis Systems, pp. 217– 21. IEEE (2014).https://doi.org/10.1109/DAS.2014.63

  27. Tkaczyk, D., Collins, A., Sheridan, P., Beel, J.: Evaluation and comparison of open source bibliographic reference parsers: a business use case.arXiv:1802.01168 (2018)

  28. Tkaczyk, D., Collins, A., Sheridan, P., Beel, J.: Machine learning vs. rules and out-of-the-box vs. retrained: an evaluation of open-source bibliographic reference and citation parsers. In: Proceedings of the 18th ACM/IEEE on Joint Conference on Digital Libraries, pp. 99–108. Fort Worth Texas USA. ACM (2018)

    Google Scholar 

  29. Van Noorden, R.: Global scientific output doubles every nine years. nature news blog (2014).http://blogs.nature.com/news/2014/05/global-scientific-output-doublesevery-nine-years.html

  30. Wohlin, C.: Guidelines for snowballing in systematic literature studies and a replication in software engineering. In: Proceedings of the 18th International Conference on Evaluation and Assessment in Software Engineering - EASE 2014 (2014)

    Google Scholar 

  31. Xiao, Y., Watson, M.: Guidance on conducting a systematic literature review. J. Plan. Educ. Res.39(1), 93–112 (2019)

    Article  Google Scholar 

  32. Yin, P., Zhang, M., Deng, Z., Yang, D.: Metadata extraction from bibliographies using bigram HMM. In: Chen, Z., Chen, H., Miao, Q., Fu, Y., Fox, E., Lim, E.-P. (eds.) ICADL 2004. LNCS, vol. 3334, pp. 310–319. Springer, Heidelberg (2004).https://doi.org/10.1007/978-3-540-30544-6_33

    Chapter  Google Scholar 

  33. Zhang, X., Zou, J., Le, D.X., Thoma, G.R.: A structural SVM approach for reference parsing. BMC Bioinform.12(S3), S7 (2011).https://doi.org/10.1186/1471-2105-12-S3-S7

    Article  Google Scholar 

Download references

Acknowledgements

The work of Silvio Peroni has been partially funded by the European Union’s Horizon 2020 research and innovation program under grant agreement No 101017452 (OpenAIRE-Nexus).

Author information

Authors and Affiliations

  1. Digital Humanities and Digital Knowledge, Department of Classical Philology and Italian Studies, University of Bologna, Bologna, Italy

    Alessia Cioffi

  2. Research Centre for Open Scholarly Metadata, Department of Classical Philology and Italian Studies, University of Bologna, Bologna, Italy

    Silvio Peroni

  3. Digital Humanities Advanced Research Centre (/DH.arc), Department of Classical Philology and Italian Studies, University of Bologna, Bologna, Italy

    Silvio Peroni

Authors
  1. Alessia Cioffi

    You can also search for this author inPubMed Google Scholar

  2. Silvio Peroni

    You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence toSilvio Peroni.

Editor information

Editors and Affiliations

  1. University of Padua, Padua, Italy

    Gianmaria Silvello

  2. Universidad Politécnica de Madrid, Madrid, Spain

    Oscar Corcho

  3. CNR-ISTI – National Research Council, Pisa, Italy

    Paolo Manghi

  4. University of Padua, Padua, Italy

    Giorgio Maria Di Nunzio

  5. Linnaeus University, Växjö, Sweden

    Koraljka Golub

  6. University of Padua, Padua, Italy

    Nicola Ferro

  7. Sapienza University of Rome, Rome, Italy

    Antonella Poggi

Rights and permissions

Copyright information

© 2022 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Cioffi, A., Peroni, S. (2022). Structured References from PDF Articles: Assessing the Tools for Bibliographic Reference Extraction and Parsing. In: Silvello, G.,et al. Linking Theory and Practice of Digital Libraries. TPDL 2022. Lecture Notes in Computer Science, vol 13541. Springer, Cham. https://doi.org/10.1007/978-3-031-16802-4_42

Download citation

Publish with us

Access this chapter

Subscribe and save

Springer+ Basic
¥17,985 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
JPY 3498
Price includes VAT (Japan)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
JPY 9151
Price includes VAT (Japan)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
JPY 11439
Price includes VAT (Japan)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide -see info

Tax calculation will be finalised at checkout

Purchases are for personal use only


[8]ページ先頭

©2009-2025 Movatter.jp