ataylor24/MAGMAPublic

NotificationsYou must be signed in to change notification settings
Fork3
Star6

The MAGMA Benchmark is designed to evaluate the performance of large language models (LLMs) on classical graph algorithms using intermediate steps.

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 79 Commits
data_generation		data_generation
scripts		scripts
.gitignore		.gitignore
Model.py		Model.py
README.md		README.md
environment.yml		environment.yml
evaluate.py		evaluate.py
inference_bench.py		inference_bench.py
prelim.ipynb		prelim.ipynb
setup.cfg		setup.cfg
setup.py		setup.py
train.py		train.py
train_utils.py		train_utils.py

Repository files navigation

MAGMA: Multistep AlGorithMic ReAsoning Benchmark

🎉🎉PLEASE SEE THE UPDATED VERSION HERE🎉🎉:https://github.com/ataylor24/MAGMA_v2

Paper

Are Large-Language Models Graph Algorithmic Reasoners? [pdf]

Summary

The MAGMA Benchmark is designed to evaluate the performance of large language models (LLMs) on classical graph algorithms using intermediate steps. Despite advances in LLMs, they exhibit significant limitations in structured, multistep reasoning tasks, particularly those involving explicit graph structures. Our benchmark addresses this gap by evaluating state-of-the-art LLMs on five fundamental algorithms: BFS, DFS, Dijkstra's, Floyd-Warshall, and Prim's MST.

We are actively updating this benchmark! Please reach out to the contact email below with any update requests/bug fixes.

Features

Comprehensive Benchmark: Evaluates LLM performance on classical graph algorithms.
Intermediate Steps Evaluation: Focuses on the accuracy of intermediate reasoning steps.
Multiple Algorithms: Includes BFS, DFS, Dijkstra's, Floyd-Warshall, and Prim's MST.
Advanced Prompting Techniques: Explores advanced prompting techniques and algorithmic instructions.

Installation Process

Prerequisites

Python 3.10 or higher

Clone the Repository

git clone https://github.com/yourusername/LLM-CLRS-Graph-Reasoning-Benchmark.gitcd LLM-CLRS-Graph-Reasoning-Benchmark

Create a Conda Environment

To create a Conda environment with the required dependencies, run the following command:

conda env create --file environment.yml

This will create a new Conda environment with all the dependencies specified in theenvironment.yml file.

Activate the Conda Environment

Activate the newly created environment using:

conda activate nar2

Training baseline models

An example of a script used to run the benchmark on the included algorithms is included inrun_scripts.

bash run_scripts/bfs_CoT.sh

Running inference with trained models

An example of a script used to run the benchmark on a selected algorithm is included ininference_scripts.

bash inference_scripts/bfs_CoT.sh

Configuration

You can customize the model training settings using the configuration fileconfiguration_example/config_qlora.yaml.

Performance Metrics

The benchmark uses the following metrics:

Exact Match Accuracy: Measures the correctness of the final output. (Primary metric used in the paper)
F1 Score Measures the partial correcness of the final output.Note: We also include 'partial' accuracies that provide credit for outputs that are similar to the desired output but lack the full response template.

For both the Exact Match Accuracy and F1 score metrics, we provide the following variants:

Intermediate Steps Accuracy: Evaluates the performance of models on intermediate steps.
Final Step Accuracy Evaluates the performance of models on only the final step.
Trajectory Accuracy Evaluates the performance of models on the full trajectory (i.e. intermediate steps & final step)
Independent Accuracy Evaluates the performance of models on each independent inference (trajectory agnostic).

Contributing

We welcome contributions to improve this benchmark. Please follow these steps:

Fork the repository.
Create a new branch (git checkout -b feature-branch).
Commit your changes (git commit -am 'Add new feature').
Push to the branch (git push origin feature-branch).
Open a Pull Request.

Seedata_generation for further details.

Reproducibility

Seed used = 100898

BFS Llama3r &alpha: 8

Otherwise, baseline data generation and model training follow default settings for parameters.

License

This project is licensed under the MIT License - see theLICENSE file for details.

Acknowledgements

Data used in the benchmark is translated from the CLRS benchmark, which can be found here:https://github.com/google-deepmind/clrs

Model training adapted from the Huggingface Alignment Handbook:https://github.com/huggingface/alignment-handbook.git

Contact Information

For questions or feedback, please open an issue or contact us atataylor2@cs.ucla.edu.

Thank you for using the LLM-CLRS Graph Reasoning Benchmark! We hope this benchmark helps advance the understanding and capabilities of large language models in structured reasoning tasks.

About

The MAGMA Benchmark is designed to evaluate the performance of large language models (LLMs) on classical graph algorithms using intermediate steps.

Movatterモバイル変換

ataylor24/MAGMA

Folders and files

Latest commit

History

Repository files navigation

MAGMA: Multistep AlGorithMic ReAsoning Benchmark

Paper

Summary

Features

Installation Process

Prerequisites

Clone the Repository

Create a Conda Environment

Activate the Conda Environment

Training baseline models

Running inference with trained models

Configuration

Performance Metrics

Contributing

Reproducibility

License

Acknowledgements

Contact Information

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages0

Uh oh!

Contributors2

Uh oh!

Languages

Packages