Fastest SVM implementation usable in Python [closed]

Question 1

Closed. This question is seeking recommendations for software libraries, tutorials, tools, books, or other off-site resources. It does not meet Stack Overflow guidelines. It is not currently accepting answers.

We don’t allow questions seeking recommendations for software libraries, tutorials, tools, books, or other off-site resources. You can edit the question so it can be answered with facts and citations.

Closed5 years ago.

The community reviewed whether to reopen this question1 year ago and left it closed:

Original close reason(s) were not resolved

Improve this question

I'm building some predictive models in Python and have been using scikits learn's SVM implementation. It's been really great, easy to use, and relatively fast.

Unfortunately, I'm beginning to become constrained by my runtime. I run a rbf SVM on a full dataset of about 4 - 5000 with 650 features. Each run takes about a minute. But with a 5 fold cross validation + grid search (using a coarse to fine search), it's getting a bit unfeasible for my task at hand. So generally, do people have any recommendations in terms of the fastest SVM implementation that can be used in Python? That, or any ways to speed up my modeling?

I've heard of LIBSVM's GPU implementation, which seems like it could work. I don't know of any other GPU SVM implementations usable in Python, but it would definitely be open to others. Also, does using the GPU significantly increase runtime?

I've also heard that there are ways of approximating the rbf SVM by using a linear SVM + feature map in scikits. Not sure what people think about this approach. Again, anyone using this approach, is it a significant increase in runtime?

All ideas for increasing the speed of program is most welcome.

Question 2

The most scalable kernel SVM implementation I know of isLaSVM. It's written in C hence wrap-able in Python if you knowCython,ctypes orcffi. Alternatively you can use it from the command line. You can use the utilities insklearn.datasets to load convert data from aNumPy or CSR format into svmlight formatted files that LaSVM can use as training / test set.

Question 3

Thanks ogrisel. I'll take a look at this. Definitely looks interesting. Sklearn can export into svm light format? That will definitely be useful. In response to your prior answer, unfortunately, I'm dealing with timeseries, so random sampling + spitting into train/test becomes quite a bit more complicated. Not sure subsampling to train my model will be all that straightforward. Thanks!

Question 4

Sorry quick addendum ogrisel, do you know what utility function in sklearn can export in SVM light format?

Question 5

@thomas If your samples are not (loosely)iid there is a lot of chance that SVM with a generic kernel such as RBF will not yield good results. If you have time-series data (with time dependencies between consecutive measurements) you should either extract higher level features (e.g. convolutions over sliding windows orSTFT) or precompute a time series dedicated kernel.

Question 6

Hmm... interesting. Do you mind expanding on what you said? I've heard of dependent data causing issues for cross validation procedures, but not specifically for a rbf SVM. What issues can arise? And any references or pointers on what is meant by extracting higher level features? Don't know if the comment section is the best place, but would love to hear more about this. thanks.

Question 7

If the inter-samples time dependencies prevent you to do arbitrary sub-sampling & cross-validation, I don't see how the SVM RBF model will be able to learn something general: the model makes its predictions for each individual sample one at a time, independently of past predictions (no memory) hence the input features should encode some kind of high level "context" if you want it to generalize enough to make interesting predictions on previously unseen data.

Question 8

Alternatively you can run the grid search on 1000 random samples instead of the full dataset:

>>> from sklearn.cross_validation import ShuffleSplit>>> cv = ShuffleSplit(3, test_fraction=0.2, train_fraction=0.2, random_state=0)>>> gs = GridSeachCV(clf, params_grid, cv=cv, n_jobs=-1, verbose=2)>>> gs.fit(X, y)

It's very likely that the optimal parameters for 5000 samples will be very close to the optimal parameters for 1000 samples. So that's a good way to start your coarse grid search.

n_jobs=-1 makes it possible to use all your CPUs to run the individual CV fits in parallel. It's using mulitprocessing so the python GIL is not an issue.

Question 9

Firstly, according to scikit-learn's benchmark (here), scikit-learn is already one of the fastest if not fastest SVM package around. Hence, you might want to consider other ways of speeding up the training.

As suggested by bavaza, you can try to multi-thread the training process. If you are using Scikit-learn's GridSearchCV class, you can easily set n_jobs argument to be larger than the default value of 1 to perform the training in parallel at the expense of using more memory.You can find its the documentationhere An example of how to use the class can be foundhere

Alternatively, you can take a look at Shogun Machine Learning Libraryhere

Shogun is designed for large scale machine learning with wrappers to many common svm packages and it is implemented in C/C++ with bindings for python. According to Scikit-learn's benchmark above, it's speed is comparable to scikit-learn. On other tasks (other than the one they demonstrated), it might be faster so it is worth giving a try.

Lastly, you can try to perform dimension reduction e.g. using PCA or randomized PCA to reduce the dimension of your feature vectors. That would speed up the training process. The documentation for the respective classes can be found in these 2 links:PCA,Randomized PCA . You can find examples on how to use them in Scikit-learn's examples section.

Question 10

If you are interested in only using the RBF kernel (or any other quadratic kernel for that matter), then I suggest using LIBSVM onMATLAB orOctave. I train a model of 7000 observations and 500 features in about 6 seconds.

The trick is to use precomputed kernels that LIBSVM provides, and use some matrix algebra to compute the kernel in one step instead of lopping over the data twice. The kernel takes about two seconds to build as opposed to a lot more using LIBSVM own RBF kernel. I presume you would be able to do so in Python usingNumPy, but I am not sure as I have not tried it.

Question 11

Generally speaking LibSVM is a good mature lib, but I think it's not the fastest and 7000 x 500 is very small problem to test.

Question 12

Without going to much into comparing SVM libraries, I think the task you are describing (cross-validation) can benefit from real multi-threading (i.e. running several CPUs in parallel). If you are usingCPython, it does not take advantage of your (probably)-multi-core machine, due toGIL.

You can try other implementations of Python which don't have this limitation. SeePyPy orIronPython if you are willing to go to .NET.

Question 13

Thanks bavaza I'll take a look into it. Assuming I do take advantage of my multicore computer, any other suggestions on speeding up my program? I was going figure out a way to cross validate across multiple threads anyways. However, I think I still need a speed up.

Question 14

@bavaza , I have been running Python in Multiple cores for many years , it works very well. Please research multiprocessing lib of standard CPython.

Question 15

@V3ss0n, thanks. Looks like a nice lib. As it uses processes and not threads, are you familiar with any context-switching penalties (e.g. when using a large worker pool)?

Question 16

PyPy also has a GIL (even if they have an experimental project to implement an alternate memory management strategy); As some have said, to avoid the GIL the easiest way to go is still multiprocessing instead of using threading. I'm really not sure using IronPython will give better performance (with all the .NET overhead)

Question 17

Trysvm_light!

It is a wicked-fast C implementation from theinfamous Thorsten Joachims at Cornell, with good Python bindings, and you can install it withpip install pysvmlight.

Question 18

I think you can tryThunderSVM which utilizes GPUs.

Question 19

If your problem is in two classes, this wrapping of CUDA-based SVM with scikit-learn is useful:

https://github.com/niitsuma/gpusvm/tree/master/python

Question 20

I'd consider using arandom forest to reduce the number of features you input.

There is an option with the ExtraTreesRegressor and ExtraTreesClassifier to generate feature importances. You can then use this information to input a subset of features into your SVM.

Question 21

I suggest looking at Scikit-Learn'sStochastic Gradient Descent implementation. The default hinge loss is a linear SVM. I've found it to be blazingly fast.

ogrisel 40.3k14 gold badges120 silver badges125 bronze badges · Accepted Answer · 2016-04-19 10:04:59Z

31

The most scalable kernel SVM implementation I know of isLaSVM. It's written in C hence wrap-able in Python if you knowCython,ctypes orcffi. Alternatively you can use it from the command line. You can use the utilities insklearn.datasets to load convert data from aNumPy or CSR format into svmlight formatted files that LaSVM can use as training / test set.

Share

Improve this answer

editedApr 19, 2016 at 10:04

answeredFeb 15, 2012 at 20:33

ogrisel

40.3k14 gold badges120 silver badges125 bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

tomas

tomas Over a year ago

Thanks ogrisel. I'll take a look at this. Definitely looks interesting. Sklearn can export into svm light format? That will definitely be useful. In response to your prior answer, unfortunately, I'm dealing with timeseries, so random sampling + spitting into train/test becomes quite a bit more complicated. Not sure subsampling to train my model will be all that straightforward. Thanks!

2012-02-15T20:49:44.843Z+00:00

tomas

tomas Over a year ago

Sorry quick addendum ogrisel, do you know what utility function in sklearn can export in SVM light format?

2012-02-15T20:55:20.083Z+00:00

ogrisel

ogrisel Over a year ago

@thomas If your samples are not (loosely)iid there is a lot of chance that SVM with a generic kernel such as RBF will not yield good results. If you have time-series data (with time dependencies between consecutive measurements) you should either extract higher level features (e.g. convolutions over sliding windows orSTFT) or precompute a time series dedicated kernel.

2012-02-15T21:02:57.85Z+00:00

tomas

tomas Over a year ago

Hmm... interesting. Do you mind expanding on what you said? I've heard of dependent data causing issues for cross validation procedures, but not specifically for a rbf SVM. What issues can arise? And any references or pointers on what is meant by extracting higher level features? Don't know if the comment section is the best place, but would love to hear more about this. thanks.

2012-02-16T13:31:29.953Z+00:00

ogrisel

ogrisel Over a year ago

If the inter-samples time dependencies prevent you to do arbitrary sub-sampling & cross-validation, I don't see how the SVM RBF model will be able to learn something general: the model makes its predictions for each individual sample one at a time, independently of past predictions (no memory) hence the input features should encode some kind of high level "context" if you want it to generalize enough to make interesting predictions on previously unseen data.

2012-02-16T18:06:00.623Z+00:00

|

Movatterモバイル変換

Collectives™ on Stack Overflow

Fastest SVM implementation usable in Python [closed]

10 Answers10

7 Comments

Comments

Comments

1 Comment

4 Comments

Comments

Comments

Comments

Comments

Comments

Related

Hot Network Questions