- Daniel Seifert11,
- Lisa Jöckel11,
- Adam Trendowicz11,
- Marcus Ciolkowski12,
- Thorsten Honroth11 &
- …
- Andreas Jedlitschka11
Part of the book series:Lecture Notes in Computer Science ((LNCS,volume 15452))
Included in the following conference series:
395Accesses
Abstract
The use of large language models (LLMs) for software engineering is growing, especially for code - typically to generate code or to detect or fix quality problems. Because requirements are often written in natural language, it seems promising to exploit the capabilities of LLMs to detect requirement problems. We replicated an inspection experiment in which computer science students searched for defects in requirements documents using different reading techniques. In our replication, we used the LLM GPT-4-Turbo instead of students to determine how the model compares to human reviewers. Additionally, we considered GPT-3.5-Turbo, Nous-Hermes-2-Mixtral-8x7B-DPO, and Phi-3-medium-128k-instruct for one research question. We focus on single prompt approaches and avoid more complex approaches to mimic the original study design where students received all the material at once. We had two phases. First, we explored the general feasibility of using LLMs for requirements inspection on a practice document and examined different prompts. Second, we applied selected approaches to two requirements documents and compared the approaches to each other and to human reviewers. The approaches include variations in reading techniques (ad-hoc, perspective-based, checklist-based), LLMs, the instructions, and material provided. We found that LLMs (a) report only a limited number of deficits despite having enough tokens, which (b) do not vary much across prompts. They (c) rarely match the sample solution.
This is a preview of subscription content,log in via an institution to check access.
Access this chapter
Subscribe and save
- Get 10 units per month
- Download Article/Chapter or eBook
- 1 Unit = 1 Article or 1 Chapter
- Cancel anytime
Buy Now
- Chapter
- JPY 3498
- Price includes VAT (Japan)
- eBook
- JPY 7550
- Price includes VAT (Japan)
- Softcover Book
- JPY 9437
- Price includes VAT (Japan)
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Arora, C., Grundy, J., Abdelrazek, M.: Advancing requirements engineering through generative AI: assessing the role of LLMs.arXiv:2310.13976 (2023)
Basili, V.R., Green, S., Laitenberger, et al.: The empirical investigation of Perspective-Based Reading. Empirical SE (1996)
Berhanu, F., Alemneh, E.: Classification and prioritization of requirements smells using machine learning techniques. ICT4DA (2023)
Ciolkowski, M., Laitenberger, O., Biffl, S.: Software reviews: the state of the practice. IEEE Softw. (2003)
Ciolkowski, M.: What do we know about perspective-based reading? An approach for quantitative aggregation in SE. ESEM (2009)
Ciolkowski, M., Differding, C., Laitenberger, O., Münch, J.: Empirical investigation of perspective-based reading: a replicated experiment. ISERN (1997)
Habib, M.K., Wagner, S., Graziotin, D.: AIRE (2021)
Hou, X., Zhao, Y., Liu, Y., et al.: Large language models for SE: a systematic literature review.arXiv:2308.10620 (2023)
Krasner, H.: The cost of poor software quality in the USA: a 2022 report. From Problem to Solutions, CISQ (2022)
Krishna, M., Gaur, B., Verma, A., Jalote, P.: Using LLMs in software requirements specifications: an empirical evaluationarXiv:2404.17842 (2024)
Li, C., Wang, J., Zhang, Y., et al.: Large language models understand and can be enhanced by emotional stimuli.arXiv:2307.11760 (2023)
Luitel, D., Hassani, S., Sabetzadeh, M.: Improving requirements completeness: automated assistance through large language models. Requirem. Eng. (2024)
Mittal, A., Murthy, R., Kumar, V., Bhat, R.: Towards understanding and mitigating the hallucinations in NLP and Speech. CODS-COMAD (2024)
Naeem, A., Aslam, Z., Shah, M.A.: Analyzing quality of software requirements; a comparison study on NLP tools. ICAC (2019)
Nguyen-Duc, A., Cabrero-Daniel, B., Przybylek, A., et al.: Generative artificial intelligence for SE – a research agenda.arXiv:2310.18648 (2023)
Nouri, A., Cabrero-Daniel, B., Törner, F., et al.: Engineering safety requirements for autonomous driving with large language models.arXiv:2403.16289 (2024)
OpenAI: OpenAI base library: large language models (2024).https://api.python.langchain.com/en/latest/llms/langchain_openai.llms.base.OpenAI.html. Accessed 17 July 2024
OpenAI, Achiam, J., Adler, et al.: GPT-4. Technical Report.arXiv:2303.08774 (2023)
Parra, E., Dimou, C., Llorens, J., et al.: A methodology for the classification of quality of requirements using machine learning techniques. Inform. Softw. Technol. (2015)
Porter, A.A., Votta, L.G., Basili, V.R.: Comparing detection methods for soft-ware requirements inspections: a replicated experiment. IEEE Trans. Soft. Eng. (1995)
Shepperd, M.: A critique of cyclomatic complexity as a software metric. Soft. Eng. J. UK (1988)
Slaughter, S.A., Harter, D.E., Krishnan, M.S.: Evaluating the cost of software quality. ACM, Commun (1998)
Wang, J., Huang, Y., Chen, C., et al.: Software testing with large language models: survey, landscape, and vision. IIEEE Trans. Soft. Eng. (2024)
Waseem, M., Das, T., Ahmad, A., et al.: ChatGPT as a software development bot: a project-based study.arXiv:2310.13648 (2023)
White, J., Hays, S., Fu, Q., Spencer-Smith, J., Schmidt, D.C.: ChatGPT prompt patterns for improving code quality, refactoring, requirements elicitation, and software design.arXiv:2303.07839 (2023)
Wohlin, C.: Experimentation in SE. An introduction. The Kluwer international series in SE (2012)
Zheng, Z., Ning, K., Wang, Y., et al.: A survey of large language models for code: evolution, benchmarking, and future trends.arXiv:2311.10372 (2023)
Acknowledgments
Parts of this work have been funded by the German Federal Ministry of Education and Research (BMBF) in the project “DeepQuali” (grant no. 01IS23016D).
Author information
Authors and Affiliations
Fraunhofer Institute for Experimental SE IESE, Fraunhofer Platz 1, 67663, Kaiserslautern, Germany
Daniel Seifert, Lisa Jöckel, Adam Trendowicz, Thorsten Honroth & Andreas Jedlitschka
QAware GmbH, Aschauer Street 30, 81549, München, Germany
Marcus Ciolkowski
- Daniel Seifert
You can also search for this author inPubMed Google Scholar
- Lisa Jöckel
You can also search for this author inPubMed Google Scholar
- Adam Trendowicz
You can also search for this author inPubMed Google Scholar
- Marcus Ciolkowski
You can also search for this author inPubMed Google Scholar
- Thorsten Honroth
You can also search for this author inPubMed Google Scholar
- Andreas Jedlitschka
You can also search for this author inPubMed Google Scholar
Corresponding author
Correspondence toDaniel Seifert.
Editor information
Editors and Affiliations
University of Tartu, Tartu, Estonia
Dietmar Pfahl
Blekinge Institute of Technology, Karlskrona, Sweden
Javier Gonzalez Huerta
Leibniz Universität Hannover, Hannover, Germany
Jil Klünder
University of Tartu, Tartu, Estonia
Hina Anwar
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Seifert, D., Jöckel, L., Trendowicz, A., Ciolkowski, M., Honroth, T., Jedlitschka, A. (2025). Can Large Language Models (LLMs) Compete with Human Requirements Reviewers? – Replication of an Inspection Experiment on Requirements Documents. In: Pfahl, D., Gonzalez Huerta, J., Klünder, J., Anwar, H. (eds) Product-Focused Software Process Improvement. PROFES 2024. Lecture Notes in Computer Science, vol 15452. Springer, Cham. https://doi.org/10.1007/978-3-031-78386-9_3
Download citation
Published:
Publisher Name:Springer, Cham
Print ISBN:978-3-031-78385-2
Online ISBN:978-3-031-78386-9
eBook Packages:Computer ScienceComputer Science (R0)
Share this paper
Anyone you share the following link with will be able to read this content:
Sorry, a shareable link is not currently available for this article.
Provided by the Springer Nature SharedIt content-sharing initiative