THUDM/LongBenchPublic

NotificationsYou must be signed in to change notification settings
Fork94
Star926

LongBench v2 and LongBench (ACL 25'&24')

License

MIT license

926 stars 94 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 79 Commits
LongBench		LongBench
config		config
misc		misc
prompts		prompts
LICENSE		LICENSE
README.md		README.md
pred.py		pred.py
requirements.txt		requirements.txt
result.py		result.py

Repository files navigation

📚 LongBench v2: Towards Deeper Understanding and Reasoning on Realistic Long-context Multitasks

🌐Project Page • 📚LongBench v2 Paper • 📊LongBench v2 Dataset • 𝕏Thread

📖LongBench Paper • 🤗LongBench Dataset

📢 The original LongBench v1 related files are moved underLongBench/, read its READMEhere.

LongBench v2 is designed to assess the ability of LLMs to handle long-context problems requiringdeep understanding and reasoning across real-world multitasks. LongBench v2 has the following features: (1)Length: Context length ranging from 8k to 2M words, with the majority under 128k. (2)Difficulty: Challenging enough that even human experts, using search tools within the document, cannot answer correctly in a short time. (3)Coverage: Cover various realistic scenarios. (4)Reliability: All in a multiple-choice question format for reliable evaluation.

To elaborate, LongBench v2 consists of 503 challenging multiple-choice questions, with contexts ranging from 8k to 2M words, across six major task categories: single-document QA, multi-document QA, long in-context learning, long-dialogue history understanding, code repo understanding, and long structured data understanding. To ensure the breadth and the practicality, we collect data from nearly 100 highly educated individuals with diverse professional backgrounds. We employ both automated and manual review processes to maintain high quality and difficulty, resulting in human experts achieving only 53.7% accuracy under a 15-minute time constraint. Our evaluation reveals that the best-performing model, when directly answers the questions, achieves only 50.1% accuracy. In contrast, the o1-preview model, which includes longer reasoning, achieves 57.7%, surpassing the human baseline by 4%. These results highlight the importance ofenhanced reasoning ability and scaling inference-time compute to tackle the long-context challenges in LongBench v2.

🔍 With LongBench v2, we are eager to find out how scaling inference-time compute will affect deep understanding and reasoning in long-context scenarios. View our 🏆 leaderboardhere (updating).

🔥 Updates

🔥🔥🔥[2024/01/15] More evaluation results added to ourleaderboard, including Gemini-Exp-1206, Gemini-2.0-Flash, DeepSeek-V3, and MiniMax-Text-01, Check them out!

🔥🔥🔥[2024/12/20] We are excited to releaseLongBench v2! Compared to the first generation of LongBench, LongBench v2 is much longer and much more challenging. Its goal is to provide a reliable evaluation standard for the development of future superhuman long-context AI systems.

⚙️ How to evaluate on LongBench v2

Load Data

You can download and load theLongBench v2 data through the Hugging Face datasets (🤗 HF Repo):

fromdatasetsimportload_datasetdataset=load_dataset('THUDM/LongBench-v2',split='train')

Alternatively, you can download the file fromthis link to load the data.

Data Format

All data inLongBench v2 are standardized to the following format:

{"_id":"Unique identifier for each piece of data","domain":"The primary domain category of the data","sub_domain":"The specific sub-domain category within the domain","difficulty":"The difficulty level of the task, either 'easy' or 'hard'","length":"The length category of the task, which can be 'short', 'medium', or 'long'","question":"The input/command for the task, usually short, such as questions in QA, queries in many-shot learning, etc","choice_A":"Option A","choice_B":"Option B","choice_C":"Option C","choice_D":"Option D","answer":"The groundtruth answer, denoted as A, B, C, or D","context":"The long context required for the task, such as documents, books, code repositories, etc."}

Evaluation

Install the requirements with pip:pip install -r requirements.txt.

To run model evaluation, first add your model path and its context window length toconfig/, then follow these steps (we takeGLM-4-9B-Chat for a running example):

Step 1: Deploy the Model with vLLM

First, deploy your model usingvLLM. Run the following command to serve the model:

vllm serve THUDM/glm-4-9b-chat --api-key token-abc123 --tensor-parallel-size 4 --gpu-memory-utilization 0.95 --max_model_len 131072 --trust-remote-code

--tensor-parallel-size 4 specifies the number of tensor parallelism slices. It should be set to higher value, i.e., 8, to serve larger models such asLlama-3.1-70B-Instruct orQwen2.5-72B-Instruct.
Adjust--gpu-memory-utilization to control GPU memory usage.
Set--max_model_len to the context window length of the model.

Step 2: Run Model Inference

Once your model is deployed, modify theURL andAPI_KEY inpred.py to match your serving instance. Run the model inference with the following command:

python pred.py --model GLM-4-9B-Chat

--cot: Enable evaluation under the Chain-of-Thought (CoT) setting.
--no_context: Test the model’s performance without the long context (pure memorization).
--rag N: Use top-N retrieved contexts during +RAG evaluation. This is set to 0 by default to disable RAG. For details on the retrieval process, refer to theretrieve.py file.

Step 3: Export Results

Finally, runpython result.py to export the evaluation results.

📝 Citation

@article{bai2024longbench2,  title={LongBench v2: Towards Deeper Understanding and Reasoning on Realistic Long-context Multitasks},   author={Yushi Bai and Shangqing Tu and Jiajie Zhang and Hao Peng and Xiaozhi Wang and Xin Lv and Shulin Cao and Jiazheng Xu and Lei Hou and Yuxiao Dong and Jie Tang and Juanzi Li},  journal={arXiv preprint arXiv:2412.15204},  year={2024}}@inproceedings{bai2024longbench,    title = "{L}ong{B}ench: A Bilingual, Multitask Benchmark for Long Context Understanding",    author = "Bai, Yushi and Lv, Xin  and Zhang, Jiajie  and Lyu, Hongchang  and      Tang, Jiankai  and Huang, Zhidian  and Du, Zhengxiao  and Liu, Xiao  and Zeng, Aohan  and Hou, Lei  and Dong, Yuxiao  and Tang, Jie  and Li, Juanzi",    booktitle = "Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",    month = aug,    year = "2024",    address = "Bangkok, Thailand",    publisher = "Association for Computational Linguistics",    url = "https://aclanthology.org/2024.acl-long.172",    doi = "10.18653/v1/2024.acl-long.172",    pages = "3119--3137",}

About

LongBench v2 and LongBench (ACL 25'&24')

longbench2.github.io

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

License

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

📚 LongBench v2: Towards Deeper Understanding and Reasoning on Realistic Long-context Multitasks

🔥 Updates

⚙️ How to evaluate on LongBench v2

Load Data

Data Format

Evaluation

Step 1: Deploy the Model with vLLM

Step 2: Run Model Inference

Step 3: Export Results

📝 Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages

Uh oh!

Contributors5

Languages

Movatterモバイル変換

License

THUDM/LongBench

Folders and files

Latest commit

History

Repository files navigation

📚 LongBench v2: Towards Deeper Understanding and Reasoning on Realistic Long-context Multitasks

🔥 Updates

⚙️ How to evaluate on LongBench v2

Load Data

Data Format

Evaluation

Step 1: Deploy the Model with vLLM

Step 2: Run Model Inference

Step 3: Export Results

📝 Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages0

Uh oh!

Contributors5

Languages

Packages