nemo_rl.evals.eval#
Module Contents#
Classes#
Functions#
Set up components for model evaluation. | |
Evaluate pass@k score using an unbiased estimator. | |
Evaluate cons@k score using an unbiased estimator. | |
Main entry point for running evaluation using environment. | |
Unified implementation for both sync and async evaluation. | |
Generate texts using either sync or async method. | |
Save evaluation data to a JSON file. | |
Print evaluation results. |
API#
- classnemo_rl.evals.eval.EvalConfig#
Bases:
typing.TypedDict- metric:str#
None
- num_tests_per_prompt:int#
None
- seed:int#
None
- k_value:int#
None
- save_path:str|None#
None
- classnemo_rl.evals.eval._PassThroughMathConfig#
Bases:
typing.TypedDict
- classnemo_rl.evals.eval.MasterConfig#
Bases:
typing.TypedDict- eval:nemo_rl.evals.eval.EvalConfig#
None
- generation:nemo_rl.models.generation.interfaces.GenerationConfig#
None
- tokenizer:nemo_rl.models.policy.TokenizerConfig#
None
- data:nemo_rl.data.EvalDataConfigType#
None
- nemo_rl.evals.eval.setup(
- master_config:nemo_rl.evals.eval.MasterConfig,
- tokenizer:transformers.AutoTokenizer,
- dataset:nemo_rl.data.datasets.AllTaskProcessedDataset,
Set up components for model evaluation.
Initializes the VLLM model and data loader.
- Parameters:
master_config – Configuration settings.
dataset – Dataset to evaluate on.
- Returns:
VLLM model, data loader, and config.
- nemo_rl.evals.eval.eval_pass_k(
- rewards:torch.Tensor,
- num_tests_per_prompt:int,
- k:int,
Evaluate pass@k score using an unbiased estimator.
Reference: https://github.com/huggingface/evaluate/blob/32546aafec25cdc2a5d7dd9f941fc5be56ba122f/metrics/code_eval/code_eval.py#L198-L213
- Parameters:
rewards – Tensor of shape (batch_size * num_tests_per_prompt)
k – int (pass@k value)
- Returns:
float
- Return type:
pass_k_score
- nemo_rl.evals.eval.eval_cons_k(
- rewards:torch.Tensor,
- num_tests_per_prompt:int,
- k:int,
- extracted_answers:list[str|None],
Evaluate cons@k score using an unbiased estimator.
- Parameters:
rewards – Tensor of shape (batch_size * num_tests_per_prompt)
num_tests_per_prompt – int
k – int
extracted_answers – list[str| None]
- Returns:
float
- Return type:
cons_k_score
- nemo_rl.evals.eval.run_env_eval(vllm_generation,dataloader,env,master_config)#
Main entry point for running evaluation using environment.
Generates model responses and evaluates them by env.
- Parameters:
vllm_generation – Model for generating responses.
dataloader – Data loader with evaluation samples.
env – Environment that scores responses.
master_config – Configuration settings.
- asyncnemo_rl.evals.eval._run_env_eval_impl(
- vllm_generation,
- dataloader,
- env,
- master_config,
- use_async=False,
Unified implementation for both sync and async evaluation.
- asyncnemo_rl.evals.eval._generate_texts(vllm_generation,inputs,use_async)#
Generate texts using either sync or async method.
- nemo_rl.evals.eval._save_evaluation_data_to_json(
- evaluation_data,
- master_config,
- save_path,
Save evaluation data to a JSON file.
- Parameters:
evaluation_data – List of evaluation samples
master_config – Configuration dictionary
save_path – Path to save evaluation results. Set to null to disable saving.Example: “results/eval_output” or “/path/to/evaluation_results”
- nemo_rl.evals.eval._print_results(
- master_config,
- generation_config,
- score,
- dataset_size,
- metric,
- k_value,
- num_tests_per_prompt,
Print evaluation results.