NotificationsYou must be signed in to change notification settings
Fork26k
Star62.7k

Fix sample weight handling in SAG(A)#31675

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Jump to bottom

Open

snath-xoc wants to merge20 commits intoscikit-learn:main

base:main

Choose a base branch

fromsnath-xoc:sag_tests

Open

Fix sample weight handling in SAG(A)#31675

snath-xoc wants to merge20 commits intoscikit-learn:mainfromsnath-xoc:sag_tests

Conversation

Copy link

Contributor

snath-xoc commentedJun 28, 2025

Reference Issues/PRs

Fixes issue on sample weight handling within SAG(A)#31536.

What does this implement/fix? Explain your changes.

SAG(A) now accounts for sample weights by:

Applying a weighted sampling of random indices when sample weights are not None. This should be equivalent to sampling from a repeated dataset uniformly (i.e., frequency based weighting)
Calculates the step size by accounting for sample weights

TO DO:

Apply the same sample weight corrections inget_auto_step_size

snath-xoc added18 commits

April 25, 2025 16:38

lazy test added

82706dd

Merge github.com:snath-xoc/scikit-learn

1908831

kerge branch 'main' of github.com:snath-xoc/scikit-learn

ffcf4d4

Merge branch 'main' of github.com:snath-xoc/scikit-learn

2184bff

modify tests

1fa4984

modify lml gradient tests

e11bc4e

modify lml grad tests

92f04fe

lazy test added

06d6139

Merge branch 'main' into add_gradient_check_gpr

33db8c2

merge main

68bf517

Merge branch 'main' into add_gradient_check_gpr

5da4237

restore sag

bbbc657

initial fix

ddff54d

implement preliminary sag fix

3a73e7a

lazy test added

0adb631

merge main

de36de3

Merge branch 'main' of github.com:snath-xoc/scikit-learn into sag_tests

d160736

Merge branch 'main' into sag_tests

9b78491

github-actionsbot added module:gaussian_process module:linear_model labels

Jun 28, 2025

Copy link

github-actionsbot commentedJun 28, 2025•
edited
Loading

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

_{Generated for commit:69b61b9. Link to the linter CI:here}

revert gpr changes

8615425

Copy link

ContributorAuthor

snath-xoc commentedJun 28, 2025

Running the test as follows

importnumpyasnpfromscipy.statsimportkstestfromsklearn.linear_model.tests.test_sagimportsag,squared_dloss,get_step_sizefromsklearn.datasetsimportmake_regression,make_classificationfromsklearn.utils._testingimportassert_allclose_dense_sparsealpha=1n_features=6rng=np.random.RandomState(0)X,y=make_classification(n_samples=1000,random_state=77,n_features=n_features)weights=rng.randint(0,5,size=X.shape[0])X_repeated=np.repeat(X,weights,axis=0)y_repeated=np.repeat(y,weights,axis=0)weights_w_all=np.zeros([n_features,50])weights_r_all=np.zeros([n_features,50])step_size_w=get_step_size(X,alpha,True,sample_weight=weights)step_size_r=get_step_size(X_repeated,alpha,True)forrandom_stateinnp.arange(50):weights_w,int_w=sag(X,y,step_size=step_size_w,sample_weight=weights,alpha=alpha,dloss=squared_dloss,random_state=random_state)weights_w_all[:,random_state]=weights_wweights_r,int_r=sag(X_repeated,y_repeated,step_size=step_size_r,alpha=alpha,dloss=squared_dloss,random_state=random_state)weights_r_all[:,random_state]=weights_rprint(kstest(weights_r_all[0],weights_w_all[0]))

Now gives the result

KstestResult(statistic=np.float64(0.2), pvalue=np.float64(0.2719135601522248), statistic_location=np.float64(-0.004336382594871251), statistic_sign=np.int8(1))

Copy link

Member

ogrisel commentedJul 1, 2025•
edited
Loading

@snath-xoc at the meeting, you mentioned that the statistical test would not pass for some datasets. Could you please post an example and add a TODO item to the PR to investigate this problem.

Also, whenever penalty is non-zero, the problem is strictly convex and the solution show be unique. So it should be possible to write deterministic tests (with various random seed values) instead of statistical tests to:

craft a minimal reproducer that does not evolve running a KS-test;
check whether the proposed fixed can fix the bug for all possible seeds in the[0, 99] range for instance.

This might require settingtol to a low enough value,max_iter to a large enough value, and checking that noConvergenceWarning is raised.

antoinebaker reviewed

Jul 2, 2025

View reviewed changes

sklearn/linear_model/tests/test_sag.py

Comment on lines -99 to 106

		if sample_weight is not None:
		gradient *= sample_weight[idx]

		update = entry * gradient + alpha * weights

Copy link

Contributor

antoinebakerJul 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

You should keep thegradient multiplied by thesample_weight even if you change the sampling of idx, otherwise you are not computing the correct loss gradient.

Copy link

Contributor

antoinebakerJul 2, 2025•
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

The loss with$\ell_2$ regularizer to minimize wrt$w$=weights coefficient of the linear model (skipping the intercept for now):
$loss(w) = \frac{1}{S} \sum_i s_i l(y_i, w^Tx_i) + \frac{\alpha}{2} \Vert w \Vert^2$
can be put in several ways as the finite sum minimized by sag(a):
$g(w) = \frac{1}{n} \sum_i f_i(w) \propto loss(w)$

Let$l_i(w) = l(y_i, w^Tx_i)$ be the data$i$ loss term. We can choose:

$f_i(w) = s_i l_i(w) + s_i \frac{\alpha}{2} \Vert w \Vert^2 $. The gradient updated and stored in memory is thenupdate = sample_weight[idx] * (entry * gradient + alpha * weights)
$f_i(w) = s_i l_i(w) + \frac{S}{n} \frac{\alpha}{2} \Vert w \Vert^2$. This gives the code in main where only the data$i$ loss term is weighted.
$f_i(w) = s_i l_i(w)$. The gradient updated and stored in memory is thenupdate = sample_weight[idx] * entry * gradient, ie only the data$i$ loss term. The regularizer is then taken into account when doing the gradient descent step onweights, for exampleweights -= alpha*step_size*weights.

I feel 2. (the current code) is a bit weird, for instance there is maybe a weird scaling ofalpha as$\frac{S}{n}\alpha$. However all three options lead to the same updatedweights at the end, so it shouldn't matter too much.

Copy link

Contributor

antoinebakerJul 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I think I have a slight preference for option 1. with the meaning of adding a term proportional to$s_i$, in which case it would make sense to redefinen_seen as the weighted sum of seen indices, which will converge to$S$ instead of$n$.

Copy link

ContributorAuthor

snath-xocJul 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

You should keep thegradient multiplied by thesample_weight even if you change the sampling of idx, otherwise you are not computing the correct loss gradient.

I am not entirely sure about this, in the case of sampling with frequency weighted probabilities, is this not equivalent to uniformly sampling from the repeated set of data in which case we can maintain the update step as is? Not sure if I have misunderstood something here?

In the above case it seems like we are considering weights to be more of the form of precision weights no?

Copy link

Contributor

antoinebakerJul 7, 2025•
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I think the confusion comes from the comparison with SGD.

Let's step back. For the weighted version consider a function$f(w) = \frac{1}{S} \sum_i s_i f_i(w)$ to minimize.

The (FG) full gradient step is$w_{k+1} = w_k - \gamma f'(w_k)$
where the correct gradient is$f'(w_k) = \frac{1}{S} \sum_i s_i f'_i(w)$

SG (stochastic gradient) and SAG (stochastic average gradient) are two stochastic methods to approximate the FG but in different ways.

SG: the descent step uses the gradient at a randomly selected data point. In the unweighted case, you must sample uniformly the data points. Otherwise the gradient (averaged over some iterations) would be biased and your descent does not converge toward the right minima. In the weighted case therefore you have two choices: either you sample uniformly$i$ and multiply the gradient by the sample weight

$w_{k+1} = w_k - \gamma s_i f'_i(w_k)$

or you sample$j$ proportionally to the sample weights but not multiply the gradient:

$w_{k+1} = w_k - \gamma f'_j(w_k)$

Notice that, in expectation, and averaged over some iterations, the gradient approximates the FG$f'(w) = \frac{1}{S} \sum_i s_i f'_i(w)$. If you were to use the sample weight two times, you would get a biased estimate of the FG.

SAG(A): the descent step uses an approximation of theaverage gradient$f'(w) = \frac{1}{S} \sum_i s_i \nabla f_i(w)$. The trick is to store individual gradients in memory, denoted$f'_i(\phi_i^k)$ or$y_i^k$ in the sag(a) papers. The individual gradient is updated for the randomly selected data point$i$. The descent step always uses the approximate average gradient$\frac{1}{S} \sum_i s_i f'_i(\phi_i^k) \sim f'(w)$. So here you apply directly an approximation of the FG at each iteration, which is not biased, whatever the sampling used for the data point selection. The sampling only affects which part of the average gradient is most regularly updated.

This a crucial difference with SG. In the unweighted case for example, to quote the sag paper:

In standard SG methods, it is crucial to sample the functions fi uniformly, at least asymptotically, in order to yield an unbiased gradient estimate and subsequently achieve convergence to the optimal value (alternately, the bias induced by non-uniform sampling would need to be asymptotically corrected). In SAG iterations, however, the weight of each gradient is constant and equal to 1/n, regardless of the frequency at which the corresponding function is sampled. We might thus consider sampling the functions fi non-uniformly, without needing to correct for this bias.

So then a question is which sampling procedure to use ? See above and the sag(a) paper, but I think it involves looking at the Lipschitz smoothness constant.

sklearn/linear_model/tests/test_sag.py

Comment on lines +247 to +252

		def get_step_size(X, alpha, fit_intercept, classification=True, sample_weight=None):
		if sample_weight is None:
		X_prod = np.sum(X * X, axis=1)
		else:
		X = X[sample_weight != 0]
		X_prod = np.sum(X * X, axis=1)

Copy link

Contributor

antoinebakerJul 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I don't think removing the zero sample_weight is enough.

If I understood correctly the sag(a) paper the recommended step size isstep_size = 1 / L where$L = \max_i L_i$ and$L_i$ is the Lipschitz smoothness constant of$f_i(w)$.

It will depend on how we decompose the weighted sum into individual$f_i(w)$ terms. With the options above:

$L_i = s_i \kappa \Vert x_i \Vert^2 + s_i \alpha$.
$L_i = s_i \kappa \Vert x_i \Vert^2 + \frac{S}{n} \alpha$.

where$\kappa = 1$ for regression and$\kappa = 1/4$ for classification.

Copy link

Contributor

antoinebakerJul 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

In the end, it depends on how we allocate the regularizer:

evenly wrt the indices, each$i$ has the same regularizer$\frac{S}{n} \frac{\alpha}{2}\Vert w \Vert^2$
evenly wrt thesample_weight, each$i$ has a regularizer$s_i \frac{\alpha}{2}\Vert w \Vert^2$ proportional to itssample_weight.

Maybe we should try both strategies and see which one leads to better convergence ?

Copy link

Contributor

antoinebakerJul 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I think we first need to figure out the step size computation before trying non-uniform sampling, for example by following5.5 Effect of non-uniform sampling of the sag paper.

Copy link

ContributorAuthor

snath-xocJul 4, 2025•
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Hmmmm so I guess you are suggesting a variable step size only for the special case of sample weights?

Copy link

Contributor

antoinebakerJul 7, 2025•
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Sorry if I wasn't clear, I meant first find/test the correct formula forstep_size withsample_weight (see options 1 and 2 above).

Later we can try nonuniform sampling, which I think would be more difficult than just proportional to the sample_weight, see the discussion in 5.5 around the Lipschitz constants.

Copy link

Member

ogrisel commentedJul 4, 2025•
edited
Loading

Regarding#31675 (comment).

I indeed observe that fittingLogisticRegression(C=1e6, max_iter=10_000, solver='saga', tol=1e-10) on a noisy target and with largesample_weight often does not converge (ConvergenceWarning).
the reference Python implementation in this PR does not check for convergence, hence does not raise the warning, but I suspect it does not converge either.

I think we should start by adding a convergence check to the reference Python code similar to what is done for the main Cython implementation.

EDIT: here is a minimal reproducer for the convergence bug:

# %%importnumpyasnpfromsklearn.linear_modelimportLogisticRegressionfromsklearn.exceptionsimportConvergenceWarningimportwarningswarnings.filterwarnings("error",category=ConvergenceWarning)rng=np.random.RandomState(42)X=rng.randn(1000,10)y=rng.randint(0,2,size=X.shape[0])# Use sample weights to trigger the convergence warning# sample_weight = rng.randint(0, 10, size=X.shape[0]).astype(np.float64)sample_weight=np.full(shape=X.shape[0],fill_value=5,dtype=np.float64)# Uncomment any of the two following lines and the convergence problems go away.# sample_weight /= sample_weight.max()  # reduce magnitude# sample_weight = Nonecommon_params= {"tol":1e-12,"max_iter":10_000,"C":1e10,"random_state":42,}# %%forsolverin ["lbfgs","saga"]:lr=LogisticRegression(solver=solver,**common_params).fit(X,y,sample_weight=sample_weight)print(lr.solver,lr.n_iter_[0])

lbfgs7# no convergence problem with LBFGSTraceback (mostrecentcalllast):CellIn[37],line3lr=LogisticRegression(solver=solver,**common_params).fit(X,y,sample_weight=sample_weight)File~/code/scikit-learn/sklearn/base.py:1365inwrapperreturnfit_method(estimator,*args,**kwargs)File~/code/scikit-learn/sklearn/linear_model/_logistic.py:1384infitfold_coefs_=Parallel(n_jobs=self.n_jobs,verbose=self.verbose,prefer=prefer)(File~/code/scikit-learn/sklearn/utils/parallel.py:82in__call__returnsuper().__call__(iterable_with_config_and_warning_filters)File~/miniforge3/envs/dev/lib/python3.13/site-packages/joblib/parallel.py:1986in__call__returnoutputifself.return_generatorelselist(output)File~/miniforge3/envs/dev/lib/python3.13/site-packages/joblib/parallel.py:1914in_get_sequential_outputres=func(*args,**kwargs)File~/code/scikit-learn/sklearn/utils/parallel.py:147in__call__returnself.function(*args,**kwargs)File~/code/scikit-learn/sklearn/linear_model/_logistic.py:560in_logistic_regression_pathw0,n_iter_i,warm_start_sag=sag_solver(File~/code/scikit-learn/sklearn/linear_model/_sag.py:348insag_solverwarnings.warn(ConvergenceWarning:Themax_iterwasreachedwhichmeansthecoef_didnotconverge

EDIT: the weights can be constant and the problem is still present as long as their value is large enough.

ogrisel mentioned this pull request

Jul 4, 2025

DEBUG saga convergence with sample_weight#31702

Draft

Copy link

ContributorAuthor

snath-xoc commentedJul 7, 2025

importnumpyasnpfromscipy.statsimportkstestfromsklearn.linear_model.tests.test_sagimportsag,squared_dloss,get_step_sizefromsklearn.datasetsimportmake_regression,make_classificationfromsklearn.utils._testingimportassert_allclose_dense_sparsealpha=1n_features=6rng=np.random.RandomState(0)X,y=make_classification(n_samples=1000,random_state=77,n_features=n_features,n_classes=5,n_informative=4)weights=rng.randint(0,5,size=X.shape[0])X_repeated=np.repeat(X,weights,axis=0)y_repeated=np.repeat(y,weights,axis=0)weights_w_all=np.zeros([n_features,50])weights_r_all=np.zeros([n_features,50])step_size_w=get_step_size(X,alpha,True,sample_weight=weights)step_size_r=get_step_size(X_repeated,alpha,True)forrandom_stateinnp.arange(50):weights_w,int_w=sag(X,y,step_size=step_size_w,sample_weight=weights,alpha=alpha,dloss=squared_dloss,random_state=random_state)weights_w_all[:,random_state]=weights_wweights_r,int_r=sag(X_repeated,y_repeated,step_size=step_size_r,alpha=alpha,dloss=squared_dloss,random_state=random_state)weights_r_all[:,random_state]=weights_rprint(kstest(weights_r_all[0],weights_w_all[0]))

this gives results

KstestResult(statistic=np.float64(0.3), pvalue=np.float64(0.02170784069014051), statistic_location=np.float64(-0.16524780309259282), statistic_sign=np.int8(-1))

Although it still isn't too bad I do think as@antoinebaker pointed out there is still some things that aren't completely correct

Merge branch 'main' into sag_tests

69b61b9

Labels

module:gaussian_process module:linear_model

3 participants

Movatterモバイル変換

Uh oh!

Fix sample weight handling in SAG(A)#31675

Are you sure you want to change the base?

Fix sample weight handling in SAG(A)#31675

Uh oh!

Conversation

snath-xoc commentedJun 28, 2025

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Uh oh!

github-actionsbot commentedJun 28, 2025• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

✔️ Linting Passed

Uh oh!

snath-xoc commentedJun 28, 2025

Uh oh!

ogrisel commentedJul 1, 2025• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Uh oh!

antoinebakerJul 2, 2025

Choose a reason for hiding this comment

Uh oh!

antoinebakerJul 2, 2025• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

antoinebakerJul 2, 2025

Choose a reason for hiding this comment

Uh oh!

snath-xocJul 4, 2025

Choose a reason for hiding this comment

Uh oh!

antoinebakerJul 7, 2025• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

antoinebakerJul 2, 2025

Choose a reason for hiding this comment

Uh oh!

antoinebakerJul 2, 2025

Choose a reason for hiding this comment

Uh oh!

antoinebakerJul 2, 2025

Choose a reason for hiding this comment

Uh oh!

snath-xocJul 4, 2025• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

antoinebakerJul 7, 2025• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ogrisel commentedJul 4, 2025• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Uh oh!

snath-xoc commentedJul 7, 2025

Uh oh!

Uh oh!

github-actionsbot commentedJun 28, 2025•
edited
Loading

ogrisel commentedJul 1, 2025•
edited
Loading

antoinebakerJul 2, 2025•
edited
Loading

antoinebakerJul 7, 2025•
edited
Loading

snath-xocJul 4, 2025•
edited
Loading

antoinebakerJul 7, 2025•
edited
Loading

ogrisel commentedJul 4, 2025•
edited
Loading