k-randl/self-explaining_llmsPublic

NotificationsYou must be signed in to change notification settings
Fork0
Star1

Implementation of the paper "Evaluating the Reliability of Self-Explanations in Large Language Models"

License

GPL-3.0 license

1 star 0 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
data		data
resources		resources
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
ablation-general-labels.ipynb		ablation-general-labels.ipynb
ablation-probability-hazard.py		ablation-probability-hazard.py
ablation-saliency-hazard.py		ablation-saliency-hazard.py
evaluate-ablations.ipynb		evaluate-ablations.ipynb
evaluate-hazard-counterfactuals.ipynb		evaluate-hazard-counterfactuals.ipynb
evaluate-hazard.ipynb		evaluate-hazard.ipynb
evaluate-movies-counterfactuals.ipynb		evaluate-movies-counterfactuals.ipynb
evaluate-movies.ipynb		evaluate-movies.ipynb
evaluate-toxic-counterfactuals.ipynb		evaluate-toxic-counterfactuals.ipynb
evaluate-toxic.ipynb		evaluate-toxic.ipynb
requirements.txt		requirements.txt
run.sh		run.sh
self-counterfactuals-hazard.py		self-counterfactuals-hazard.py
self-counterfactuals-movies.py		self-counterfactuals-movies.py
self-counterfactuals-toxic.py		self-counterfactuals-toxic.py
self-explanations-hazard.py		self-explanations-hazard.py
self-explanations-movies.py		self-explanations-movies.py
self-explanations-toxic.py		self-explanations-toxic.py
setup.sh		setup.sh

Repository files navigation

self-explaining_llms

Implementation of the paper "Evaluating the Reliability of Self-Explanations in Large Language Models"

Abstract:This paper investigates the reliability of explanations generated by large language models (LLMs) when prompted to explain their previous output. We evaluate two kinds of such self-explanations - extractive and counterfactual - using three state-of-the-art LLMs (2B to 8B parameters) on two different classification tasks (objective and subjective).Our findings reveal, that, while these self-explanations can correlate with human judgement, they do not fully and accurately follow the model's decision process, indicating a gap between perceived and actual model reasoning.We show that this gap can be bridged because prompting LLMs for counterfactual explanations can produce faithful, informative, and easy-to-verify results. These counterfactuals offer a promising alternative to traditional explainability methods (e.g., SHAP, LIME), provided that prompts are tailored to specific tasks and checked for validity.

Usage

After cloning or downloading this repository, first run the Linux shell script./setup.sh.It will initialize the workspace by performing the following steps:

It will install the required Python modules by runningpip install -r "./requirements.txt"
It will download the necessary Python code to compute theBARTScore by Yuan et al. (2021) to "./resources/bart_score.py".
It will download and preprocess theFood Incidents Dataset by Randl et al. (2024) to "./data/food incidents - hazard/"
It will download and preprocess the "Movies" Task (Zaidan and Eisner, 2008) of theERASER benchmark by DeYoung et al. (2020) to "./data/movies/"

When preprocessing is finished, the experiments can be rerun using the shell script./run.sh which will run each of the following python files in turn:

python ./gemma-2b-hazard.py
python ./gemma-2b-movies.py
python ./gemma-7b-hazard.py
python ./gemma-7b-movies.py
python ./llama-8b-hazard.py
python ./llama-8b-movies.py

Originally, the experiments were performed usingPython 3.10.12 on 8NVIDIA RTX A5500 graphics cards with 24GB of memory each.Finally, the Jupyter Notebooksevaluate-hazard.ipynb andevaluate-movies.ipynb can be used to analyze the results.

Sources

Yuan, W., Neubig, G., & Liu, P. (2021). BARTScore: Evaluating Generated Text as Text Generation. ArXiv.

Randl, K., Karvounis, M., Marinos, G., Pavlopoulos, J., Lindgren, T., & Henriksson, A. (2024). Food Recall Incidents [Data set]. Zenodo.

Jay DeYoung, Sarthak Jain, Nazneen Fatema Rajani, Eric Lehman, Caiming Xiong, Richard Socher, and Byron C. Wallace. (2020). ERASER: A Benchmark to Evaluate Rationalized NLP Models. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4443–4458, Online. Association for Computational Linguistics.

Omar Zaidan and Jason Eisner. (2008). Modeling Annotators: A Generative Approach to Learning from Annotator Rationales. In Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing, pages 31–40, Honolulu, Hawaii. Association for Computational Linguistics.

About

Implementation of the paper "Evaluating the Reliability of Self-Explanations in Large Language Models"

Releases1

Discovery Science 2024 Latest

Nov 6, 2024

Packages

No packages published

Languages

Jupyter Notebook99.1%
Other0.9%

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

License

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

self-explaining_llms

Usage

Sources

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases1

Packages

Uh oh!

Languages

Movatterモバイル変換

License

k-randl/self-explaining_llms

Folders and files

Latest commit

History

Repository files navigation

self-explaining_llms

Usage

Sources

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases1

Packages0

Uh oh!

Languages

Packages