- Notifications
You must be signed in to change notification settings - Fork0
Implementation of the paper "Evaluating the Reliability of Self-Explanations in Large Language Models"
License
k-randl/self-explaining_llms
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
Implementation of the paper "Evaluating the Reliability of Self-Explanations in Large Language Models"
Abstract:This paper investigates the reliability of explanations generated by large language models (LLMs) when prompted to explain their previous output. We evaluate two kinds of such self-explanations - extractive and counterfactual - using three state-of-the-art LLMs (2B to 8B parameters) on two different classification tasks (objective and subjective).Our findings reveal, that, while these self-explanations can correlate with human judgement, they do not fully and accurately follow the model's decision process, indicating a gap between perceived and actual model reasoning.We show that this gap can be bridged because prompting LLMs for counterfactual explanations can produce faithful, informative, and easy-to-verify results. These counterfactuals offer a promising alternative to traditional explainability methods (e.g., SHAP, LIME), provided that prompts are tailored to specific tasks and checked for validity.
After cloning or downloading this repository, first run the Linux shell script./setup.sh
.It will initialize the workspace by performing the following steps:
- It will install the required Python modules by running
pip install -r "./requirements.txt"
- It will download the necessary Python code to compute theBARTScore by Yuan et al. (2021) to "./resources/bart_score.py".
- It will download and preprocess theFood Incidents Dataset by Randl et al. (2024) to "./data/food incidents - hazard/"
- It will download and preprocess the "Movies" Task (Zaidan and Eisner, 2008) of theERASER benchmark by DeYoung et al. (2020) to "./data/movies/"
When preprocessing is finished, the experiments can be rerun using the shell script./run.sh
which will run each of the following python files in turn:
python ./gemma-2b-hazard.py
python ./gemma-2b-movies.py
python ./gemma-7b-hazard.py
python ./gemma-7b-movies.py
python ./llama-8b-hazard.py
python ./llama-8b-movies.py
Originally, the experiments were performed usingPython 3.10.12 on 8NVIDIA RTX A5500 graphics cards with 24GB of memory each.Finally, the Jupyter Notebooksevaluate-hazard.ipynb
andevaluate-movies.ipynb
can be used to analyze the results.
About
Implementation of the paper "Evaluating the Reliability of Self-Explanations in Large Language Models"