nemo_rl.evals.eval#

Module Contents#

Classes#

Functions#

setup

Set up components for model evaluation.

eval_pass_k

Evaluate pass@k score using an unbiased estimator.

eval_cons_k

Evaluate cons@k score using an unbiased estimator.

run_env_eval

Main entry point for running evaluation using environment.

_run_env_eval_impl

Unified implementation for both sync and async evaluation.

_generate_texts

Generate texts using either sync or async method.

_save_evaluation_data_to_json

Save evaluation data to a JSON file.

_print_results

Print evaluation results.

API#

classnemo_rl.evals.eval.EvalConfig#

Bases:typing.TypedDict

metric:str#

None

num_tests_per_prompt:int#

None

seed:int#

None

k_value:int#

None

save_path:str|None#

None

classnemo_rl.evals.eval._PassThroughMathConfig#

Bases:typing.TypedDict

math:nemo_rl.environments.math_environment.MathEnvConfig#

None

classnemo_rl.evals.eval.MasterConfig#

Bases:typing.TypedDict

eval:nemo_rl.evals.eval.EvalConfig#

None

generation:nemo_rl.models.generation.interfaces.GenerationConfig#

None

tokenizer:nemo_rl.models.policy.TokenizerConfig#

None

data:nemo_rl.data.EvalDataConfigType#

None

env:nemo_rl.evals.eval._PassThroughMathConfig#

None

cluster:nemo_rl.distributed.virtual_cluster.ClusterConfig#

None

nemo_rl.evals.eval.setup(
master_config:nemo_rl.evals.eval.MasterConfig,
tokenizer:transformers.AutoTokenizer,
dataset:nemo_rl.data.datasets.AllTaskProcessedDataset,
)tuple[nemo_rl.models.generation.vllm.VllmGeneration,torch.utils.data.DataLoader,nemo_rl.evals.eval.MasterConfig]#

Set up components for model evaluation.

Initializes the VLLM model and data loader.

Parameters:
  • master_config – Configuration settings.

  • dataset – Dataset to evaluate on.

Returns:

VLLM model, data loader, and config.

nemo_rl.evals.eval.eval_pass_k(
rewards:torch.Tensor,
num_tests_per_prompt:int,
k:int,
)float#

Evaluate pass@k score using an unbiased estimator.

Reference: https://github.com/huggingface/evaluate/blob/32546aafec25cdc2a5d7dd9f941fc5be56ba122f/metrics/code_eval/code_eval.py#L198-L213

Parameters:
  • rewards – Tensor of shape (batch_size * num_tests_per_prompt)

  • k – int (pass@k value)

Returns:

float

Return type:

pass_k_score

nemo_rl.evals.eval.eval_cons_k(
rewards:torch.Tensor,
num_tests_per_prompt:int,
k:int,
extracted_answers:list[str|None],
)float#

Evaluate cons@k score using an unbiased estimator.

Parameters:
  • rewards – Tensor of shape (batch_size * num_tests_per_prompt)

  • num_tests_per_prompt – int

  • k – int

  • extracted_answers – list[str| None]

Returns:

float

Return type:

cons_k_score

nemo_rl.evals.eval.run_env_eval(vllm_generation,dataloader,env,master_config)#

Main entry point for running evaluation using environment.

Generates model responses and evaluates them by env.

Parameters:
  • vllm_generation – Model for generating responses.

  • dataloader – Data loader with evaluation samples.

  • env – Environment that scores responses.

  • master_config – Configuration settings.

asyncnemo_rl.evals.eval._run_env_eval_impl(
vllm_generation,
dataloader,
env,
master_config,
use_async=False,
)#

Unified implementation for both sync and async evaluation.

asyncnemo_rl.evals.eval._generate_texts(vllm_generation,inputs,use_async)#

Generate texts using either sync or async method.

nemo_rl.evals.eval._save_evaluation_data_to_json(
evaluation_data,
master_config,
save_path,
)#

Save evaluation data to a JSON file.

Parameters:
  • evaluation_data – List of evaluation samples

  • master_config – Configuration dictionary

  • save_path – Path to save evaluation results. Set to null to disable saving.Example: “results/eval_output” or “/path/to/evaluation_results”

nemo_rl.evals.eval._print_results(
master_config,
generation_config,
score,
dataset_size,
metric,
k_value,
num_tests_per_prompt,
)#

Print evaluation results.