mechanistic-interpretability
Here are 135 public repositories matching this topic...
Language:All
Sort:Most stars
Stanford NLP Python library for understanding and improving PyTorch models via interventions
- Updated
Oct 13, 2025 - Python
This repository collects all relevant resources about interpretability in LLMs
- Updated
Nov 1, 2024
A curated collection of resources focused on the Mechanistic Interpretability (MI) of Large Multimodal Models (LMMs). This repository aggregates surveys, blog posts, and research papers that explore how LMMs represent, transform, and align multimodal information internally.
- Updated
Oct 20, 2025
Performant framework for training, analyzing and visualizing Sparse Autoencoders (SAEs) and their frontier variants.
- Updated
Nov 29, 2025 - Python
Stanford NLP Python library for benchmarking the utility of LLM interpretability methods
- Updated
Jun 25, 2025 - Python
Decomposing and Editing Predictions by Modeling Model Computation
- Updated
Jun 12, 2024 - Jupyter Notebook
Steering vectors for transformer language models in Pytorch / Huggingface
- Updated
Feb 21, 2025 - Python
Mechanistically interpretable neurosymbolic AI (Nature Comput Sci 2024): losslessly compressing NNs to computer code and discovering new algorithms which generalize out-of-distribution and outperform human-designed algorithms
- Updated
Feb 20, 2024 - Python
Interpreting how transformers simulate agents performing RL tasks
- Updated
Oct 23, 2023 - Jupyter Notebook
[ICLR 2025] Code and Data Repo for Paper "Latent Space Chain-of-Embedding Enables Output-free LLM Self-Evaluation"
- Updated
Dec 19, 2024 - Python
Repo accompanying our paper "Do Llamas Work in English? On the Latent Language of Multilingual Transformers".
- Updated
Mar 11, 2024 - Jupyter Notebook
🧠 Starter templates for doing interpretability research
- Updated
Jul 16, 2023
Sparse probing paper full code.
- Updated
Dec 17, 2023 - Jupyter Notebook
Sparse and discrete interpretability tool for neural networks
- Updated
Feb 12, 2024 - Python
Unified access to Large Language Model modules using NNsight
- Updated
Nov 19, 2025 - Python
Generating and validating natural-language explanations for the brain.
- Updated
Nov 11, 2025 - Jupyter Notebook
[ICLR 23 spotlight] An automatic and efficient tool to describe functionalities of individual neurons in DNNs
- Updated
Nov 6, 2023 - Jupyter Notebook
CausalGym: Benchmarking causal interpretability methods on linguistic tasks
- Updated
Nov 30, 2024 - Python
Mapping out the "memory" of neural nets with data attribution
- Updated
Nov 29, 2025 - Python
Arrakis is a library to conduct, track and visualize mechanistic interpretability experiments.
- Updated
Apr 22, 2025 - Jupyter Notebook
Improve this page
Add a description, image, and links to themechanistic-interpretability topic page so that developers can more easily learn about it.
Add this topic to your repo
To associate your repository with themechanistic-interpretability topic, visit your repo's landing page and select "manage topics."