You signed in with another tab or window.Reload to refresh your session.You signed out in another tab or window.Reload to refresh your session.You switched accounts on another tab or window.Reload to refresh your session.Dismiss alert
This is a tool for evaluating FIM code generation for generating and evaluating tasks on datasets in four languages: java/python/cpp/jsvascript.
Introduction to datasets and evaluation metrics
For Java, Python, CPP, and JSVASCRIPT languages, to provide the upper and lower information of the code block, you need to predict the middle filling.Evaluation metrics include:Exact Match、BLEU-4、CODE-BLEU、Length(Pred/Ref)
Exact Match
BLEU-4
CODE-BLEU
Length(Pred/Ref)
Environment Requirements
To run the model inference and evaluatiaon code, you'll need the following environment setup:
Python 3.8 or higher
PyTorch 2.1.0 or higher
sentencepiece 0.2.0 or higher
transformers 4.34.1 or higher (if run inference by transformers library)
Please ensure all dependencies are installed using the following command:
requirements.txt listed all necessary libraries and their versions.
To achieve faster inference speeds, especially for large models, we recommend installingflash attention.Flash attention is an optimized attention mechanism that significantly reduces computation time for transformer-based models without sacrificing accuracy.
Before proceeding, ensure your environment meets the CUDA requirements asflash attention leverages GPU acceleration. Follow these steps to installflash attention:
Here's an example of a generate task.python run_inference.py --model aiXcoder/aixcoder-7b-base --language java
--modelmodel name on huggingface,
Currently, FIM generation can be performed for four models on the huggingface
- deepseek-ai/deepseek-coder-6.7b-base- aiXcoder/aixcoder-7b-base- codellama/CodeLlama-7b-hf- bigcode/starcoder2-7bYou can also set the model weight file that has been downloaded locally
--languageDataset language
Support Python Java Cplus JavaScript four languages, you can set a language separately, you can also set multiple languages at the same time, and multiple languages are separated by spaces
--output_dirThe output path of the generated result is saved in the output_dir folder in the current directory by default
--deviceSet the cuda used, default cuda
--torch_dtypeSet the precision, default bf16, can be set to:"fp32", "fp16", "bf16"
--attn_implementationThe setting uses FlashAttention, default True, if you don't support FlashAttention, set this to False
--gen_lenSet max generate length, default 512
--max_lenSetmax_new_tokens, default 16384
Evaluation
Here's an example of a evaluate task.python run_evaluate.py
--languageThe language to be evaluated
Support Python Java Cplus JavaScript four languages, you can set a language separately, you can also set multiple languages at the same time, and multiple languages are separated by spaces
--result_pathBy default, the output path of the evaluation results is stored in the output_dir folder in the current directoryTwo files are generated with the suffix _scored.jsonl and _statistics.txtThe results of each assessment for each Task Type and the average of the total results are recorded in the _statistics.txt