Movatterモバイル変換


[0]ホーム

URL:


Skip to main content

Advertisement

Springer Nature Link
Log in

Can Large Language Models (LLMs) Compete with Human Requirements Reviewers? – Replication of an Inspection Experiment on Requirements Documents

  • Conference paper
  • First Online:

Abstract

The use of large language models (LLMs) for software engineering is growing, especially for code - typically to generate code or to detect or fix quality problems. Because requirements are often written in natural language, it seems promising to exploit the capabilities of LLMs to detect requirement problems. We replicated an inspection experiment in which computer science students searched for defects in requirements documents using different reading techniques. In our replication, we used the LLM GPT-4-Turbo instead of students to determine how the model compares to human reviewers. Additionally, we considered GPT-3.5-Turbo, Nous-Hermes-2-Mixtral-8x7B-DPO, and Phi-3-medium-128k-instruct for one research question. We focus on single prompt approaches and avoid more complex approaches to mimic the original study design where students received all the material at once. We had two phases. First, we explored the general feasibility of using LLMs for requirements inspection on a practice document and examined different prompts. Second, we applied selected approaches to two requirements documents and compared the approaches to each other and to human reviewers. The approaches include variations in reading techniques (ad-hoc, perspective-based, checklist-based), LLMs, the instructions, and material provided. We found that LLMs (a) report only a limited number of deficits despite having enough tokens, which (b) do not vary much across prompts. They (c) rarely match the sample solution.

This is a preview of subscription content,log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
¥17,985 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
JPY 3498
Price includes VAT (Japan)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
JPY 7550
Price includes VAT (Japan)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
JPY 9437
Price includes VAT (Japan)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide -see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

References

  1. Arora, C., Grundy, J., Abdelrazek, M.: Advancing requirements engineering through generative AI: assessing the role of LLMs.arXiv:2310.13976 (2023)

  2. Basili, V.R., Green, S., Laitenberger, et al.: The empirical investigation of Perspective-Based Reading. Empirical SE (1996)

    Google Scholar 

  3. Berhanu, F., Alemneh, E.: Classification and prioritization of requirements smells using machine learning techniques. ICT4DA (2023)

    Google Scholar 

  4. Ciolkowski, M., Laitenberger, O., Biffl, S.: Software reviews: the state of the practice. IEEE Softw. (2003)

    Google Scholar 

  5. Ciolkowski, M.: What do we know about perspective-based reading? An approach for quantitative aggregation in SE. ESEM (2009)

    Google Scholar 

  6. Ciolkowski, M., Differding, C., Laitenberger, O., Münch, J.: Empirical investigation of perspective-based reading: a replicated experiment. ISERN (1997)

    Google Scholar 

  7. Habib, M.K., Wagner, S., Graziotin, D.: AIRE (2021)

    Google Scholar 

  8. Hou, X., Zhao, Y., Liu, Y., et al.: Large language models for SE: a systematic literature review.arXiv:2308.10620 (2023)

  9. Krasner, H.: The cost of poor software quality in the USA: a 2022 report. From Problem to Solutions, CISQ (2022)

    Google Scholar 

  10. Krishna, M., Gaur, B., Verma, A., Jalote, P.: Using LLMs in software requirements specifications: an empirical evaluationarXiv:2404.17842 (2024)

  11. Li, C., Wang, J., Zhang, Y., et al.: Large language models understand and can be enhanced by emotional stimuli.arXiv:2307.11760 (2023)

  12. Luitel, D., Hassani, S., Sabetzadeh, M.: Improving requirements completeness: automated assistance through large language models. Requirem. Eng. (2024)

    Google Scholar 

  13. Mittal, A., Murthy, R., Kumar, V., Bhat, R.: Towards understanding and mitigating the hallucinations in NLP and Speech. CODS-COMAD (2024)

    Google Scholar 

  14. Naeem, A., Aslam, Z., Shah, M.A.: Analyzing quality of software requirements; a comparison study on NLP tools. ICAC (2019)

    Google Scholar 

  15. Nguyen-Duc, A., Cabrero-Daniel, B., Przybylek, A., et al.: Generative artificial intelligence for SE – a research agenda.arXiv:2310.18648 (2023)

  16. Nouri, A., Cabrero-Daniel, B., Törner, F., et al.: Engineering safety requirements for autonomous driving with large language models.arXiv:2403.16289 (2024)

  17. OpenAI: OpenAI base library: large language models (2024).https://api.python.langchain.com/en/latest/llms/langchain_openai.llms.base.OpenAI.html. Accessed 17 July 2024

  18. OpenAI, Achiam, J., Adler, et al.: GPT-4. Technical Report.arXiv:2303.08774 (2023)

  19. Parra, E., Dimou, C., Llorens, J., et al.: A methodology for the classification of quality of requirements using machine learning techniques. Inform. Softw. Technol. (2015)

    Google Scholar 

  20. Porter, A.A., Votta, L.G., Basili, V.R.: Comparing detection methods for soft-ware requirements inspections: a replicated experiment. IEEE Trans. Soft. Eng. (1995)

    Google Scholar 

  21. Shepperd, M.: A critique of cyclomatic complexity as a software metric. Soft. Eng. J. UK (1988)

    Google Scholar 

  22. Slaughter, S.A., Harter, D.E., Krishnan, M.S.: Evaluating the cost of software quality. ACM, Commun (1998)

    Book  Google Scholar 

  23. Wang, J., Huang, Y., Chen, C., et al.: Software testing with large language models: survey, landscape, and vision. IIEEE Trans. Soft. Eng. (2024)

    Google Scholar 

  24. Waseem, M., Das, T., Ahmad, A., et al.: ChatGPT as a software development bot: a project-based study.arXiv:2310.13648 (2023)

  25. White, J., Hays, S., Fu, Q., Spencer-Smith, J., Schmidt, D.C.: ChatGPT prompt patterns for improving code quality, refactoring, requirements elicitation, and software design.arXiv:2303.07839 (2023)

  26. Wohlin, C.: Experimentation in SE. An introduction. The Kluwer international series in SE (2012)

    Google Scholar 

  27. Zheng, Z., Ning, K., Wang, Y., et al.: A survey of large language models for code: evolution, benchmarking, and future trends.arXiv:2311.10372 (2023)

Download references

Acknowledgments

Parts of this work have been funded by the German Federal Ministry of Education and Research (BMBF) in the project “DeepQuali” (grant no. 01IS23016D).

Author information

Authors and Affiliations

  1. Fraunhofer Institute for Experimental SE IESE, Fraunhofer Platz 1, 67663, Kaiserslautern, Germany

    Daniel Seifert, Lisa Jöckel, Adam Trendowicz, Thorsten Honroth & Andreas Jedlitschka

  2. QAware GmbH, Aschauer Street 30, 81549, München, Germany

    Marcus Ciolkowski

Authors
  1. Daniel Seifert

    You can also search for this author inPubMed Google Scholar

  2. Lisa Jöckel

    You can also search for this author inPubMed Google Scholar

  3. Adam Trendowicz

    You can also search for this author inPubMed Google Scholar

  4. Marcus Ciolkowski

    You can also search for this author inPubMed Google Scholar

  5. Thorsten Honroth

    You can also search for this author inPubMed Google Scholar

  6. Andreas Jedlitschka

    You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence toDaniel Seifert.

Editor information

Editors and Affiliations

  1. University of Tartu, Tartu, Estonia

    Dietmar Pfahl

  2. Blekinge Institute of Technology, Karlskrona, Sweden

    Javier Gonzalez Huerta

  3. Leibniz Universität Hannover, Hannover, Germany

    Jil Klünder

  4. University of Tartu, Tartu, Estonia

    Hina Anwar

Rights and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Seifert, D., Jöckel, L., Trendowicz, A., Ciolkowski, M., Honroth, T., Jedlitschka, A. (2025). Can Large Language Models (LLMs) Compete with Human Requirements Reviewers? – Replication of an Inspection Experiment on Requirements Documents. In: Pfahl, D., Gonzalez Huerta, J., Klünder, J., Anwar, H. (eds) Product-Focused Software Process Improvement. PROFES 2024. Lecture Notes in Computer Science, vol 15452. Springer, Cham. https://doi.org/10.1007/978-3-031-78386-9_3

Download citation

Publish with us

Access this chapter

Subscribe and save

Springer+ Basic
¥17,985 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
JPY 3498
Price includes VAT (Japan)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
JPY 7550
Price includes VAT (Japan)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
JPY 9437
Price includes VAT (Japan)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide -see info

Tax calculation will be finalised at checkout

Purchases are for personal use only


[8]ページ先頭

©2009-2025 Movatter.jp