explanare/ood-predictionPublic

NotificationsYou must be signed in to change notification settings
Fork0
Star7

Internal Causal Mechanisms Robustly Predict Language Model Out-of-Distribution Behaviors

License

MIT license

7 stars 0 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
data		data
figures		figures
models		models
src		src
LICENSE		LICENSE
README.md		README.md
mmlu_ood_prediction_demo.ipynb		mmlu_ood_prediction_demo.ipynb

Repository files navigation

Internal Causal Mechanisms Robustly Predict Language Model Out-of-Distribution Behaviors

Interpretability research now offers a variety of techniques for identifying abstract internal mechanisms in neural networks. Can such techniques be used to predict how models will behave on out-of-distribution examples?

In this work, we evaluate methods using output probabilities, internal causal-agnostic features, internal causal features to predict correctness of LLM outputs. We show that the most robust features for correctness prediction are those that play a distinctive causal role in the model's behavior.

The hypothesized correspondence between internal mechanisms and generalization behaviors. In this work, we focus on the prediction direction.

Dataset

We release a dataset of five correctness prediction tasks. Given a task input and an LLM, the goal is to predict whether the LLM output is correct.

Tasks

The five tasks cover symbol manipulation, knowledge retrieval, and instruction following, as shown below.

Task Type	Have Known Internal Mechanisms	Task Names
Symbol manipulation	Fully known	Indirect Object Identification (IOI);PriceTag
Knowledge retrieval	Partially known	RAVEL;MMLU
Instruction following	Partially known	Unlearn Harry Potter

Data format

Each JSON file represents one fold of a task, structured as follows:

{"train" : {"correct": ["prompt_0","prompt_1",      ...    ],"wrong": ["prompt_0","prompt_1",      ...    ]  },"val": {    ...  },"test": {    ...  }}

We release the prompts used in our experiment, where the "correct" and "wrong" labels are determined usingLlama-3-8B-Instruct as the target model.

If you are using these tasks to predict behaviors of a different target model, you need to regenerate the correctness label of these prompts.

Methods

We evaluate four correctness prediction methods, categorized by the type of features they use.

Method	Feature Type	Requires Training	Requires Wrong Samples	Requires Counterfactuals	Requires Decoding
Confidence Score	Output probabilities	✗	✗	✗	✓
Correctness Probing	Internal causal-agnostic features	✓	✓	✗	Maybe
Counterfactual Simulation	Internal causal features	Localization only	✗	✓	✓
Value Probing	Internal causal features	✓	✗	Localization only	Maybe

Demo

We provide a demo evaluating each method on the MMLU correctness prediction task.

Citation

If you use the content of this repo, please kindly consider citing the following work

@inproceedings{huang2025internal,title={Internal Causal Mechanisms Robustly Predict Language Model Out-of-Distribution Behaviors},author={Jing Huang and Junyi Tao and Thomas Icard and Diyi Yang and Christopher Potts},booktitle={Forty-second International Conference on Machine Learning},year={2025},url={https://openreview.net/forum?id=Ofa1cspTrv}}

About

Internal Causal Mechanisms Robustly Predict Language Model Out-of-Distribution Behaviors

openreview.net/forum?id=Ofa1cspTrv

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

License

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Internal Causal Mechanisms Robustly Predict Language Model Out-of-Distribution Behaviors

Dataset

Tasks

Data format

Methods

Demo

Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Languages

Movatterモバイル変換

License

explanare/ood-prediction

Folders and files

Latest commit

History

Repository files navigation

Internal Causal Mechanisms Robustly Predict Language Model Out-of-Distribution Behaviors

Dataset

Tasks

Data format

Methods

Demo

Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Languages