scikit-learn/scikit-learnPublic

NotificationsYou must be signed in to change notification settings
Fork26k
Star62.7k

What is preventing sklearn to achieve true model persistence?#30609

Pierre-Bartet started this conversation inGeneral

Pierre-Bartet

Jan 8, 2025

· 2 comments· 2 replies

Return to top

Discussion options

Pierre-Bartet
Jan 8, 2025

What is preventingsklearn to achieve true model persistence?
For examplemodel.dump(..) +LogisticRegression.load(...) ?
All the existing solutions are brittle or force users to use exactly the samesklearn version for training and inference:
https://scikit-learn.org/1.6/model_persistence.html

I understand that this is a deliberate choice because sklearn's team lack of resources, but offloading serialization logic to external libraries can only end up in an a much worse maintenance, communication, and interdependence nightmare.

For examplesklearn-onnx accesses privatesklearn components to be able to serialize them (such asPolynomialFeatures's_min_degree, or gradient boosting's_predictors).

Covering all ofsklearn components would be a tremendous task, but it could be done step by step, and it is also somewhat parallelizable by assigning a few models to anyone who would be happy to help.

You must be logged in to vote

Replies: 2 comments 2 replies

Comment options

glemaitre
Jan 10, 2025
Maintainer

Basically, it is more a maintenance burden where with the team, we estimate that we could not maintain it. However, we had recent discussion in which we think that we could have a trimmed inference estimator for each estimator, reducing the impact of potential private changes that make it to update scikit-learn versions in this setting. Basically, it would make the life easier for packages assklearn-onnx.

It would be possible to working on persistence with afit + inference but the maintenance is really the bottleneck.

You must be logged in to vote

2 replies

Comment options

Pierre-Bartet Jan 10, 2025
Author

Thanks, I understand the maintenance burden issue but right now asklearn non breaking change (such as removing one of the above private attribute) can breaksklearn-onnx or any external attempt at serializing models which seems to be an even larger burden maintenance for everyone (sklearn team included).

Your trimmed inference estimator idea is awesome !

Comment options

Pierre-Bartet Jan 20, 2025
Author

Another path would be to "just" make sure everything necessary (but nothing more) for inference is accessible as public attributes (without creating a new class for each estimator), so that tools such assklearn-onnx can rely on something reliable (and also maybe help them reach a point weresklearn-onnx is less buggy and has more coverage, since it is still a huge task).

Comment options

jcbsv
Apr 15, 2025

I concur with you Pierre-Bartet, it should be feasible to implement model persistence as a community effort. Issue#31143 is relevant for this discussion. There is no need for deciding on a persistence format, the only requirement is that parameters/state can be retrieved from a model, as either numpy or python native data structures. And conversely, that a model can consume the same as input for initialisation.

You must be logged in to vote

0 replies

Movatterモバイル変換

Uh oh!

What is preventing sklearn to achieve true model persistence?#30609

Uh oh!

Uh oh!

Pierre-BartetJan 8, 2025

Replies: 2 comments· 2 replies

Uh oh!

glemaitreJan 10, 2025 Maintainer

Uh oh!

Uh oh!

Pierre-BartetJan 10, 2025 Author

Uh oh!

Pierre-BartetJan 20, 2025 Author

Uh oh!

Uh oh!

jcbsvApr 15, 2025

Uh oh!

Pierre-Bartet
Jan 8, 2025

Replies: 2 comments 2 replies

glemaitre
Jan 10, 2025
Maintainer

Pierre-Bartet Jan 10, 2025
Author

Pierre-Bartet Jan 20, 2025
Author

jcbsv
Apr 15, 2025