- Notifications
You must be signed in to change notification settings - Fork0
Internal Causal Mechanisms Robustly Predict Language Model Out-of-Distribution Behaviors
License
explanare/ood-prediction
Folders and files
| Name | Name | Last commit message | Last commit date | |
|---|---|---|---|---|
Repository files navigation
Interpretability research now offers a variety of techniques for identifying abstract internal mechanisms in neural networks. Can such techniques be used to predict how models will behave on out-of-distribution examples?
In this work, we evaluate methods using output probabilities, internal causal-agnostic features, internal causal features to predict correctness of LLM outputs. We show that the most robust features for correctness prediction are those that play a distinctive causal role in the model's behavior.
The hypothesized correspondence between internal mechanisms and generalization behaviors. In this work, we focus on the prediction direction.
We release a dataset of five correctness prediction tasks. Given a task input and an LLM, the goal is to predict whether the LLM output is correct.
The five tasks cover symbol manipulation, knowledge retrieval, and instruction following, as shown below.
| Task Type | Have Known Internal Mechanisms | Task Names |
|---|---|---|
| Symbol manipulation | Fully known | Indirect Object Identification (IOI);PriceTag |
| Knowledge retrieval | Partially known | RAVEL;MMLU |
| Instruction following | Partially known | Unlearn Harry Potter |
Each JSON file represents one fold of a task, structured as follows:
{"train" : {"correct": ["prompt_0","prompt_1", ... ],"wrong": ["prompt_0","prompt_1", ... ] },"val": { ... },"test": { ... }}We release the prompts used in our experiment, where the "correct" and "wrong" labels are determined usingLlama-3-8B-Instruct as the target model.
If you are using these tasks to predict behaviors of a different target model, you need to regenerate the correctness label of these prompts.
We evaluate four correctness prediction methods, categorized by the type of features they use.
| Method | Feature Type | Requires Training | Requires Wrong Samples | Requires Counterfactuals | Requires Decoding |
|---|---|---|---|---|---|
| Confidence Score | Output probabilities | ✗ | ✗ | ✗ | ✓ |
| Correctness Probing | Internal causal-agnostic features | ✓ | ✓ | ✗ | Maybe |
| Counterfactual Simulation | Internal causal features | Localization only | ✗ | ✓ | ✓ |
| Value Probing | Internal causal features | ✓ | ✗ | Localization only | Maybe |
We provide a demo evaluating each method on the MMLU correctness prediction task.
If you use the content of this repo, please kindly consider citing the following work
@inproceedings{huang2025internal,title={Internal Causal Mechanisms Robustly Predict Language Model Out-of-Distribution Behaviors},author={Jing Huang and Junyi Tao and Thomas Icard and Diyi Yang and Christopher Potts},booktitle={Forty-second International Conference on Machine Learning},year={2025},url={https://openreview.net/forum?id=Ofa1cspTrv}}About
Internal Causal Mechanisms Robustly Predict Language Model Out-of-Distribution Behaviors
Topics
Resources
License
Uh oh!
There was an error while loading.Please reload this page.