Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

ENH add ARPACK solver toIncrementalPCA to avoid densifying sparse data#29512

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Open
Charlie-XIAO wants to merge23 commits intoscikit-learn:main
base:main
Choose a base branch
Loading
fromCharlie-XIAO:fast-ipca-sparse

Conversation

Charlie-XIAO
Copy link
Contributor

@Charlie-XIAOCharlie-XIAO commentedJul 17, 2024
edited
Loading

Fixes#28386, motivated by#18689.

  • Use_implicit_column_offset operator as implemented in#18689.
  • Addsvd_solver parameter supporting"full" (default, original behavior) and"arpack" (truncated SVD)
  • Implement_implicit_vstack operator to avoid densifying data in intermediate steps.
  • Add tests for_implicit_vstack.
  • Add tests for theIncrementalPCA withsvd_solver="arpack".
  • Test performance improvement onfetch_20newsgroups_vectorized dataset and update changelog.

Enhancement Overview

The following code uses the first 500 entries from the 20 newsgroups training set, of shape(500, 130107). When both using truncated SVD via ARPACK, the sparse routine is ~3x faster and saves >30x memory than the dense routine. Compare with dense routine with full SVD (which is the original setup), it is ~10x faster.

Example code

importtimeimporttracemallocimportnumpyasnpimportpandasaspdfromsklearn.datasetsimportfetch_20newsgroups_vectorizedfromsklearn.decompositionimportIncrementalPCAdefmeasure_performance(func,*args,**kwargs):tracemalloc.start()start=time.perf_counter()result=func(*args,**kwargs)elapsed=time.perf_counter()-start_,peak=tracemalloc.get_traced_memory()tracemalloc.stop()return {"time_s":elapsed,"peak_mb":peak/ (1024**2)},resultdefsparse_ipca_arpack(X):ipca=IncrementalPCA(n_components=20,svd_solver="arpack")coords=ipca.fit_transform(X)returnipca,coordsdefdense_ipca_arpack(X):X_dense=X.toarray()ipca=IncrementalPCA(n_components=20,svd_solver="arpack")coords=ipca.fit_transform(X_dense)returnipca,coordsdefdense_ipca_full(X):X_dense=X.toarray()ipca=IncrementalPCA(n_components=20,svd_solver="full")coords=ipca.fit_transform(X_dense)returnipca,coordsdefmain():n_samples=3000X,_=fetch_20newsgroups_vectorized(return_X_y=True)X=X[:n_samples]methods= [        ("Sparse ARPACK",sparse_ipca_arpack),        ("Dense ARPACK",dense_ipca_arpack),        ("Dense Full",dense_ipca_full),    ]metrics= {}models= {}coords= {}print()print(f"\033[1mBenchmarking on{n_samples} samples...\033[0m")forname,funcinmethods:print(f"Running{name}...",end=" ",flush=True)stats,output=measure_performance(func,X)model,coord=outputmetrics[name]=statsmodels[name]=modelcoords[name]=coordprint(f"Time ={stats['time_s']:.3f}s, Peak Memory ={stats['peak_mb']:.2f}MB")print()print("\033[1mVerifying results...\033[0m")base="Dense Full"base_model=models[base]forname,_inmethods:ifname==base:continuemodel=models[name]assertnp.allclose(base_model.components_,model.components_)assertnp.allclose(base_model.explained_variance_,model.explained_variance_)assertnp.allclose(base_model.singular_values_,model.singular_values_)print(f"-{base} vs{name}: OK")print("All results are equivalent! ✅")print()print("\033[1mSummarizing performance and memory usage...\033[0m")base_stats=metrics[base]fornameinmethods:key=name[0]ifkey==base:metrics[key]["speedup"]=1.0metrics[key]["memory_saving"]=1.0else:t=metrics[key]["time_s"]m=metrics[key]["peak_mb"]metrics[key]["speedup"]=base_stats["time_s"]/tmetrics[key]["memory_saving"]=base_stats["peak_mb"]/mdf=pd.DataFrame(metrics).Tdf=df[["time_s","peak_mb","speedup","memory_saving"]]print(df.round(3))if__name__=="__main__":main()

Benchmarking on 3000 samples...Running Sparse ARPACK... Time = 1.716s, Peak Memory = 3005.11MBRunning Dense ARPACK... Time = 18.594s, Peak Memory = 9320.87MBRunning Dense Full... Time = 122.849s, Peak Memory = 14960.27MBVerifying results...- Dense Full vs Sparse ARPACK: OK- Dense Full vs Dense ARPACK: OKAll results are equivalent! ✅Summarizing performance and memory usage...                time_s    peak_mb  speedup  memory_savingSparse ARPACK    1.716   3005.110   71.586          4.978Dense ARPACK    18.594   9320.873    6.607          1.605Dense Full     122.849  14960.265    1.000          1.000

Additional Comments & Questions

About the newsvd_solver parameter: This is added because I found no other way to support sparse input without densifying and I think its reasonable to add."full" (default) is the original behavior, where sparse data will be densified in batches."arpack" is the truncated SVD version that will not densify sparse data. I did not add an"auto" parameter because I think ideally it should select"arpack" for sparse data which is not the default behavior. Perhaps we can still have an"auto" option but not as the default and make it default some day?

About sparse support: Previously thefit method accepts CSR, CSC, and LIL formats. This PR no longer supports LIL format as the sparse version of_incremental_mean_and_var only supports CSR and CSC formats. We can indeed convert LIL so CSR/CSC to keep supporting that format, but is this necessary? Maybe we can just add a note somewhere in the changelog because it is very easy for users to do the conversion themselves.

About testing: I currently simply extended most tests to bothsvd_solvers on dense data; do I need to extend them on dense and sparse containers as well? Currently the only test that uses sparse data plus ARPACK solver istest_incremental_pca_sparse which performs some basic validation as before. Is this enough?

@github-actionsGitHub Actions
Copy link

github-actionsbot commentedJul 17, 2024
edited
Loading

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit:294f615. Link to the linter CI:here

@@ -86,7 +86,7 @@ def get_precision(self):
exp_var_diff = xp.where(
exp_var > self.noise_variance_,
exp_var_diff,
xp.asarray(0.0, device=device(exp_var)),
xp.asarray(1e-10, device=device(exp_var)),
Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

0.0 becomesnan when doing1.0 / exp_var_diff later, which cannot be takenlinalg.inv of. I wonder if giving it a small value instead is reasonable; otherwise perhapsexp_var should theoretically be always greater thanself.noise_variance_ which in turn means that my implementation is incorrect somewhere?

Note:test_incremental_pca_sparse whenn_components = X.shape[1] - 1 triggers the issue.

@Charlie-XIAOCharlie-XIAO changed the titleENH support partial fitting incremental PCA on sparse dataENH add ARPACK solver toIncrementalPCA to avoid densifying sparse dataJul 19, 2024
@Charlie-XIAOCharlie-XIAO marked this pull request as ready for reviewJuly 19, 2024 13:04
Sign up for freeto join this conversation on GitHub. Already have an account?Sign in to comment
Reviewers
No reviews
Assignees
No one assigned
Projects
None yet
Milestone
No milestone
Development

Successfully merging this pull request may close these issues.

Proper sparse support inIncrementalPCA
1 participant
@Charlie-XIAO

[8]ページ先頭

©2009-2025 Movatter.jp