zjunlp/WorfBenchPublic

NotificationsYou must be signed in to change notification settings
Fork6
Star106

[ICLR 2025] Benchmarking Agentic Workflow Generation

License

MIT license

106 stars 6 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
LLM		LLM
assets		assets
evaluator		evaluator
gold_traj		gold_traj
prompts		prompts
LICENSE		LICENSE
README.md		README.md
eval.sh		eval.sh
node_eval.py		node_eval.py
requirements.txt		requirements.txt

Repository files navigation

WorfBench

Benchmarking Agentic Workflow Generation

📄arXiv •🤗HFPaper •🌐Web •📊Dataset •🎧NotebookLM Audio

🌻Acknowledgement

Our code of training module is referenced and adapted fromLLaMA-Factory. And the Dataset is collected fromToolBench,ToolAlpaca,Lumos,WikiHow,Seal-Tools,Alfworld,Webshop,IntercodeSql. Our end-to-end evaluation module is based onIPR,Stable ToolBench. Thanks for their great contributions!

🌟Overview

Large Language Models (LLMs), with their exceptional ability to handle a wide range of tasks, have driven significant advancements in tackling reasoning and planning tasks, wherein decomposing complex problems into executable workflows is a crucial step in this process. Existing workflow evaluation frameworks either focus solely on holistic performance or suffer from limitations such as restricted scenario coverage, simplistic workflow structures, and lax evaluation standards. To this end, we introduce WorFBench, a unified workflow generation benchmark with multi-faceted scenarios and intricate graph workflow structures. Additionally, we present WorFEval, a systemic evaluation protocol utilizing subsequence and subgraph matching algorithms to accurately quantify the LLM agent's workflow generation capabilities. Through comprehensive evaluations across different types of LLMs, we discover distinct gaps between the sequence planning capabilities and graph planning capabilities of LLM agents, with even GPT-4 exhibiting a gap of around 15%. We also train two open-source models and evaluate their generalization abilities on held-out tasks. Furthermore, we observe that the generated workflows can enhance downstream tasks, enabling them to achieve superior performance with less time during inference. You can download our dataset fromhuggingface!

🔧Installation

git clone https://github.com/zjunlp/WorFBenchcd WorFBenchpip install -r requirements.txt

✏️Model-Inference

We usellama-facotry to deploy local model with OpenAI-style API

git clone --depth 1 https://github.com/hiyouga/LLaMA-Factory.gitcd LLaMA-Factorypip install -e".[torch,metrics]"API_PORT=8000 llamafactory-cli api examples/inference/llama3_vllm.yaml

📝Workflow-Generation

Generate workflow with local llm api

tasks=(wikihow toolbench toolalpaca lumos alfworld webshop os)model_name=your_model_namefortaskin${tasks[@]};do    python node_eval.py \        --task gen_workflow \        --model_name${model_name} \        --gold_path ./gold_traj/${task}/graph_eval.json \        --pred_path ./pred_traj/${task}/${model_name}/graph_eval_two_shot.json\        --task_type${task} \        --few_shot \done

🤔Workflow-Evaluation

Evaluation the workflow in the mode ofnode orgraph

tasks=(wikihow toolbench toolalpaca lumos alfworld webshop os)model_name=your_model_namefortaskin${tasks[@]};do    python node_eval.py \        --task eval_workflow \        --model_name${model_name} \        --gold_path ./gold_traj/${task}/graph_eval.json \        --pred_path ./pred_traj/${task}/${model_name}/graph_eval_two_shot.json\        --eval_model all-mpnet-base-v2 \        --eval_output ./eval_result/${model_name}_${task}_graph_eval_two_shot.json \        --eval_type node \        --task_type${task} \done

🚩Citation

If this work is helpful, please kindly cite as:

@article{qiao2024benchmarking,title={Benchmarking Agentic Workflow Generation},author={Qiao, Shuofei and Fang, Runnan and Qiu, Zhisong and Wang, Xiaobin and Zhang, Ningyu and Jiang, Yong and Xie, Pengjun and Huang, Fei and Chen, Huajun},journal={arXiv preprint arXiv:2410.07869},year={2024}}

About

[ICLR 2025] Benchmarking Agentic Workflow Generation

zjunlp.github.io/project/WorFBench/

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

License

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

WorfBench

Benchmarking Agentic Workflow Generation

Table of Contents

🌻Acknowledgement

🌟Overview

🔧Installation

✏️Model-Inference

📝Workflow-Generation

🤔Workflow-Evaluation

🚩Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages

Contributors5

Languages

Movatterモバイル変換

License

zjunlp/WorfBench

Folders and files

Latest commit

History

Repository files navigation

WorfBench

Benchmarking Agentic Workflow Generation

Table of Contents

🌻Acknowledgement

🌟Overview

🔧Installation

✏️Model-Inference

📝Workflow-Generation

🤔Workflow-Evaluation

🚩Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages0

Contributors5

Languages

Packages