Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Move topk file#5240

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Merged
Aaron1011 merged 81 commits intomainfromalan/evals-split-file
Dec 18, 2025
Merged
Changes from1 commit
Commits
Show all changes
81 commits
Select commitHold shift + click to select a range
49a294e
Initial partial implementation of betting-based confidence sequences
amishlerDec 10, 2025
81d4ad6
Fix bets to use previous time step's variance estimate
amishlerDec 10, 2025
ace5568
Add updates to wealth processes
amishlerDec 11, 2025
c4f4a56
Add last step of confidence sequence calculation from hedged wealth p…
amishlerDec 11, 2025
e350c5e
Change product to log-sum-exp for numerical stability
amishlerDec 11, 2025
2290c1d
Add computation of point estimator for the mean
amishlerDec 11, 2025
ba8f9fc
Combine weighting and hedged wealth process computation
amishlerDec 11, 2025
fbf8cd9
Move bet calculation to helper function, add bet truncation
amishlerDec 11, 2025
2e2e85d
Tweak comment string for clarity
amishlerDec 11, 2025
103f835
Add module-level and type-level docstrings
amishlerDec 11, 2025
35465e8
Remove max vs combo hedging choice, explain use of max in docstring
amishlerDec 11, 2025
4400783
For computing mean estimate, restrict search to inside confidence int…
amishlerDec 11, 2025
61705c0
Add unit tests, fix search for interval endpoints
amishlerDec 11, 2025
f262d08
Tweak test and test comments
amishlerDec 11, 2025
8d9f55b
Fix regularized mean value, add regression tests with known values
amishlerDec 11, 2025
a5ffaa0
Move test, clarify comment strings in another test
amishlerDec 11, 2025
8e7561b
Add test with known confidence sequence values
amishlerDec 11, 2025
d7781d4
Change tests to use non-constant observations for variance fluctuations
amishlerDec 11, 2025
79763d3
Merge branch 'main' into alan/betting-confidence-sequences
amishlerDec 11, 2025
a8ba087
Rename module to contrast with asymptotic_confidence_sequences.rs
amishlerDec 11, 2025
1e00912
Add input validation and associated tests
amishlerDec 11, 2025
45b13ad
Remove unnecessary iterator method for m-values
amishlerDec 11, 2025
b4a1c37
Make return type a Result, use anyhow for errors
amishlerDec 11, 2025
523722c
Create enums and structs
amishlerDec 11, 2025
1bf28c9
Add check_topk_stopping(), allow dead code as needed
amishlerDec 11, 2025
86e195e
Add epsilon argument, changed from allow to expect dead_code
amishlerDec 11, 2025
6553b59
Add wip note to docstring
amishlerDec 11, 2025
6210fe3
Add tests using non-zero epsilon tolerance
amishlerDec 11, 2025
6865c10
Add more tests with k_min not equal to k_max
amishlerDec 12, 2025
893f731
Add docstrings to tests
amishlerDec 12, 2025
d92a7ac
Change return value when k_min > num_variants
amishlerDec 12, 2025
1d78a4b
Tweak docstrings
amishlerDec 12, 2025
bf848a1
Remove enum that will be included in a separate PR
amishlerDec 12, 2025
c7860a6
Add VariantStatus enum
amishlerDec 12, 2025
8a2b205
Refactor evaluations to accept a vector of variants for batch processing
amishlerDec 12, 2025
e602ebe
Remove changes from topk.rs for now
amishlerDec 12, 2025
b313209
Use partition_point() instead of binary search for finding confidence…
amishlerDec 12, 2025
1ccb1ba
Make enum for specifying grid of points where wealth processes are ca…
amishlerDec 12, 2025
c62b891
Move argument validation for WealthProcessGridPoints to new construct…
amishlerDec 12, 2025
6868cd4
Fix bug where variant could count itself when checking how many varia…
amishlerDec 12, 2025
69faf6e
Add variant_names option to CLI and python client
amishlerDec 12, 2025
54fc560
Merge branch 'alan/betting-confidence-sequences' into alan/topk-enums…
amishlerDec 12, 2025
e1aeddc
Fix tests to use new WealthProcessGridPoints enum
amishlerDec 12, 2025
f7187e0
Don't multiply num_datapoints by num_variants
amishlerDec 12, 2025
3790058
Merge branch 'alan/topk-enums-structs-stopping' into alan/evals-batch…
amishlerDec 12, 2025
ab3f5fd
Update some tests to use variant_names instead of variant_name
amishlerDec 12, 2025
89ff201
Merge branch 'alan/evals-batch-variants' of github.com:tensorzero/ten…
amishlerDec 12, 2025
726346e
Merge branch 'main' into alan/evals-batch-variants
amishlerDec 12, 2025
644cfa4
Add deprecation warning for variant_name argument
amishlerDec 12, 2025
822ac6e
Add deprecation warning for variant_name arg to python client
amishlerDec 12, 2025
02bd19e
Merge branch 'alan/evals-batch-variants' of github.com:tensorzero/ten…
amishlerDec 12, 2025
0c28a29
Factor out batch evals functionality into helper function
amishlerDec 15, 2025
c0b0517
Revert change to allow passing multiple variants to run_evaluations()
amishlerDec 15, 2025
a5af109
Merge branch 'main' into alan/evals-batch-variants
amishlerDec 15, 2025
e5b3727
Add variant back into tracing span
amishlerDec 15, 2025
28691f9
Merge branch 'alan/evals-batch-variants' of github.com:tensorzero/ten…
amishlerDec 15, 2025
9252e61
Add initial top-k orchestrator types and compute_updates() placeholder
amishlerDec 15, 2025
d9b165b
Remove redundancy from top-k batch processing
amishlerDec 15, 2025
3245107
Remove some types that will be used in run_topk() for a future PR
amishlerDec 15, 2025
b31c6d2
Replace fully qualified paths with imports
amishlerDec 15, 2025
9942a02
Clarify scoring function trait, change compute_updates() to use exter…
amishlerDec 15, 2025
a668928
Change return type of process_topk_batch() to minimize computation re…
amishlerDec 15, 2025
5ab72ce
Change function signature for process_topk_batch in anticipation that…
amishlerDec 15, 2025
ffb9c1d
Correct arg name in docstring
amishlerDec 15, 2025
67a0897
Add debug checks that scores are bounded in [0, 1]
amishlerDec 15, 2025
c1b7cbb
Tweak docstring
amishlerDec 15, 2025
539e26e
Remove batch processing, organize types into sections
amishlerDec 15, 2025
051bacd
Remove TopKContext and extra imports
amishlerDec 15, 2025
b5e7085
Add remaining core types
amishlerDec 15, 2025
b070651
Add unit tests for compute_updates with non-empty evaluation results
amishlerDec 15, 2025
6f4a913
Merge branch 'main' into alan/evals-topk-batch-processing
amishlerDec 15, 2025
cb73452
Swap order of enum and struct
amishlerDec 15, 2025
701b2f9
Rename AdaptiveEvalStoppingResults to TopKTaskOutput
amishlerDec 15, 2025
c8b0cba
Expand unit tests for compute_updates(), move to top of tests mod
amishlerDec 16, 2025
249f6e6
Throw error if a task gets cancelled
amishlerDec 16, 2025
5c92bce
Add top-k tie-breaking logic and related unit tests
amishlerDec 16, 2025
b73b4e0
Create struct to replace long tuple
amishlerDec 17, 2025
93fa478
Move topk file
amishlerDec 17, 2025
792c075
Rename topk file
amishlerDec 17, 2025
fe1579e
Add module back to lib.rs
amishlerDec 17, 2025
ec5ea7d
Merge branch 'main' into alan/evals-split-file
amishlerDec 18, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
PrevPrevious commit
NextNext commit
Add more tests with k_min not equal to k_max
  • Loading branch information
@amishler
amishler committedDec 12, 2025
commit6865c10190b62ea4fad5cc4fbea84157ddf9d8d4
171 changes: 128 additions & 43 deletionsevaluations/src/topk.rs
View file
Open in desktop
Original file line numberDiff line numberDiff line change
Expand Up@@ -12,28 +12,8 @@

use std::collections::HashMap;

use crate::EvaluationVariant;
use crate::betting_confidence_sequences::MeanBettingConfidenceSequence;

#[expect(dead_code)]
const EVALUATOR_FAILURE_THRESHOLD: f32 = 0.05;
#[expect(dead_code)]
const VARIANT_FAILURE_THRESHOLD: f32 = 0.05;

// Enum for variant status during an evals run.
// Will to used in run() function, to be implemented.
#[expect(dead_code)]
enum VariantStatus {
// Still running evals on this variant
Active,
// Not running evals; variant is confidently within top k_min
Include,
// Not running evals; variant is confidently outside the top k_max
Exclude,
// Not running evals; variant failure rate is confidently >= VARIANT_FAILURE_THRESHOLD
Failed,
}

// Enum for global stopping condition.
// In case multiple stopping conditions are satisfied simultaneously,
// the highest ranked condition takes precedence. The order of the last three is fairly arbitrary.
Expand All@@ -52,29 +32,6 @@ pub enum GlobalStoppingReason {
TooManyVariantsFailed(Vec<String>),
}

// Arguments to main run() function, to be implemented
#[expect(dead_code)]
pub struct TopKVariantArgs {
evaluation_name: String,
variant_list: Vec<EvaluationVariant>,
dataset_name: String,
k_min: u32,
k_max: u32,
max_datapoints: u64,
epsilon: f32,
alpha_performance: f32,
alpha_failure: f32,
batch_size: u32,
}

// Struct for the output of the run() function, to be implemented
pub struct AdaptiveEvalStoppingResults {
pub variant_performance: Vec<MeanBettingConfidenceSequence>,
pub variant_failure_rates: Vec<MeanBettingConfidenceSequence>,
pub evaluator_failure_rates: Vec<MeanBettingConfidenceSequence>,
pub stopping_reason: GlobalStoppingReason,
}

/// Result of checking the top-k stopping condition.
#[derive(Debug, Clone)]
pub struct TopKStoppingResult {
Expand DownExpand Up@@ -495,4 +452,132 @@ mod tests {
assert_eq!(result.k, Some(3));
assert_eq!(result.top_variants.len(), 3);
}

#[test]
fn test_check_topk_stopping_returns_largest_viable_k() {
// Variant A: [0.8, 0.95] - beats all 4 others
// Variant B: [0.7, 0.85] - beats C, D, E (3 others)
// Variant C: [0.5, 0.65] - beats D, E (2 others)
// Variant D: [0.3, 0.45] - beats E (1 other)
// Variant E: [0.1, 0.25] - beats none
//
// For top-1: need to beat >= 4. Only A qualifies.
// For top-2: need to beat >= 3. A and B qualify.
// For top-3: need to beat >= 2. A, B, C qualify.
// For top-4: need to beat >= 1. A, B, C, D qualify.
// For top-5: need to beat >= 0. All qualify.
let variant_performance: HashMap<String, MeanBettingConfidenceSequence> = [
mock_cs("a", 0.8, 0.95),
mock_cs("b", 0.7, 0.85),
mock_cs("c", 0.5, 0.65),
mock_cs("d", 0.3, 0.45),
mock_cs("e", 0.1, 0.25),
]
.into_iter()
.collect();

// k_min=1, k_max=5: should return k=5 (largest viable)
let result = check_topk_stopping(&variant_performance, 1, 5, None);
assert!(result.stopped);
assert_eq!(result.k, Some(5));
assert_eq!(result.top_variants.len(), 5);

// k_min=1, k_max=3: should return k=3 (largest viable within range)
let result = check_topk_stopping(&variant_performance, 1, 3, None);
assert!(result.stopped);
assert_eq!(result.k, Some(3));
assert_eq!(result.top_variants.len(), 3);
assert!(result.top_variants.contains(&"a".to_string()));
assert!(result.top_variants.contains(&"b".to_string()));
assert!(result.top_variants.contains(&"c".to_string()));

// k_min=1, k_max=2: should return k=2
let result = check_topk_stopping(&variant_performance, 1, 2, None);
assert!(result.stopped);
assert_eq!(result.k, Some(2));
assert_eq!(result.top_variants.len(), 2);
assert!(result.top_variants.contains(&"a".to_string()));
assert!(result.top_variants.contains(&"b".to_string()));

// k_min=2, k_max=4: should return k=4 (largest viable within range)
let result = check_topk_stopping(&variant_performance, 2, 4, None);
assert!(result.stopped);
assert_eq!(result.k, Some(4));
assert_eq!(result.top_variants.len(), 4);
}

#[test]
fn test_check_topk_stopping_k_range_no_viable_k() {
// Variant A: [0.4, 0.7]
// Variant B: [0.35, 0.65]
// Variant C: [0.3, 0.6]
// All intervals overlap significantly - no one beats anyone
//
// For any k < 3, we need variants to beat others, but none do.
// Only k=3 works (need to beat >= 0).
let variant_performance: HashMap<String, MeanBettingConfidenceSequence> = [
mock_cs("a", 0.4, 0.7),
mock_cs("b", 0.35, 0.65),
mock_cs("c", 0.3, 0.6),
]
.into_iter()
.collect();

// k_min=1, k_max=2: neither k=1 nor k=2 is viable, so no stopping
let result = check_topk_stopping(&variant_performance, 1, 2, None);
assert!(!result.stopped);
assert!(result.k.is_none());

// k_min=1, k_max=3: k=3 is viable (all variants beat >= 0 others)
let result = check_topk_stopping(&variant_performance, 1, 3, None);
assert!(result.stopped);
assert_eq!(result.k, Some(3));

// k_min=3, k_max=3: k=3 is viable
let result = check_topk_stopping(&variant_performance, 3, 3, None);
assert!(result.stopped);
assert_eq!(result.k, Some(3));
}

#[test]
fn test_check_topk_stopping_k_range_partial_viability() {
// Variant A: [0.7, 0.9] - beats B and C
// Variant B: [0.4, 0.6] - beats no one (C's upper 0.55 > B's lower 0.4)
// Variant C: [0.35, 0.55] - beats no one
//
// A beats 2 (both uppers < 0.7)
// B beats 0
// C beats 0
//
// For top-1: need >= 2. Only A qualifies. ✓
// For top-2: need >= 1. Only A qualifies (1 variant). ✗
// For top-3: need >= 0. All qualify. ✓
let variant_performance: HashMap<String, MeanBettingConfidenceSequence> = [
mock_cs("a", 0.7, 0.9),
mock_cs("b", 0.4, 0.6),
mock_cs("c", 0.35, 0.55),
]
.into_iter()
.collect();

// k_min=1, k_max=3: k=3 is largest viable
let result = check_topk_stopping(&variant_performance, 1, 3, None);
assert!(result.stopped);
assert_eq!(result.k, Some(3));

// k_min=1, k_max=2: only k=1 is viable (k=2 fails)
let result = check_topk_stopping(&variant_performance, 1, 2, None);
assert!(result.stopped);
assert_eq!(result.k, Some(1));
assert_eq!(result.top_variants, vec!["a".to_string()]);

// k_min=2, k_max=2: k=2 is not viable, no stopping
let result = check_topk_stopping(&variant_performance, 2, 2, None);
assert!(!result.stopped);

// k_min=2, k_max=3: k=3 is viable
let result = check_topk_stopping(&variant_performance, 2, 3, None);
assert!(result.stopped);
assert_eq!(result.k, Some(3));
}
}

[8]ページ先頭

©2009-2026 Movatter.jp