35

We don’t allow questions seeking recommendations for software libraries, tutorials, tools, books, or other off-site resources. You can edit the question so it can be answered with facts and citations.

Closed5 years ago.

The community reviewed whether to reopen this question1 year ago and left it closed:

Original close reason(s) were not resolved

I'm building some predictive models in Python and have been using scikits learn's SVM implementation. It's been really great, easy to use, and relatively fast.

Unfortunately, I'm beginning to become constrained by my runtime. I run a rbf SVM on a full dataset of about 4 - 5000 with 650 features. Each run takes about a minute. But with a 5 fold cross validation + grid search (using a coarse to fine search), it's getting a bit unfeasible for my task at hand. So generally, do people have any recommendations in terms of the fastest SVM implementation that can be used in Python? That, or any ways to speed up my modeling?

I've heard of LIBSVM's GPU implementation, which seems like it could work. I don't know of any other GPU SVM implementations usable in Python, but it would definitely be open to others. Also, does using the GPU significantly increase runtime?

I've also heard that there are ways of approximating the rbf SVM by using a linear SVM + feature map in scikits. Not sure what people think about this approach. Again, anyone using this approach, is it a significant increase in runtime?

All ideas for increasing the speed of program is most welcome.

Peter Mortensen's user avatar
Peter Mortensen
31.4k22 gold badges110 silver badges134 bronze badges
askedFeb 15, 2012 at 18:46
tomas's user avatar
0

10 Answers10

31

The most scalable kernel SVM implementation I know of isLaSVM. It's written in C hence wrap-able in Python if you knowCython,ctypes orcffi. Alternatively you can use it from the command line. You can use the utilities insklearn.datasets to load convert data from aNumPy or CSR format into svmlight formatted files that LaSVM can use as training / test set.

answeredFeb 15, 2012 at 20:33
ogrisel's user avatar
Sign up to request clarification or add additional context in comments.

7 Comments

Thanks ogrisel. I'll take a look at this. Definitely looks interesting. Sklearn can export into svm light format? That will definitely be useful. In response to your prior answer, unfortunately, I'm dealing with timeseries, so random sampling + spitting into train/test becomes quite a bit more complicated. Not sure subsampling to train my model will be all that straightforward. Thanks!
Sorry quick addendum ogrisel, do you know what utility function in sklearn can export in SVM light format?
@thomas If your samples are not (loosely)iid there is a lot of chance that SVM with a generic kernel such as RBF will not yield good results. If you have time-series data (with time dependencies between consecutive measurements) you should either extract higher level features (e.g. convolutions over sliding windows orSTFT) or precompute a time series dedicated kernel.
Hmm... interesting. Do you mind expanding on what you said? I've heard of dependent data causing issues for cross validation procedures, but not specifically for a rbf SVM. What issues can arise? And any references or pointers on what is meant by extracting higher level features? Don't know if the comment section is the best place, but would love to hear more about this. thanks.
If the inter-samples time dependencies prevent you to do arbitrary sub-sampling & cross-validation, I don't see how the SVM RBF model will be able to learn something general: the model makes its predictions for each individual sample one at a time, independently of past predictions (no memory) hence the input features should encode some kind of high level "context" if you want it to generalize enough to make interesting predictions on previously unseen data.
|
24

Alternatively you can run the grid search on 1000 random samples instead of the full dataset:

>>> from sklearn.cross_validation import ShuffleSplit>>> cv = ShuffleSplit(3, test_fraction=0.2, train_fraction=0.2, random_state=0)>>> gs = GridSeachCV(clf, params_grid, cv=cv, n_jobs=-1, verbose=2)>>> gs.fit(X, y)

It's very likely that the optimal parameters for 5000 samples will be very close to the optimal parameters for 1000 samples. So that's a good way to start your coarse grid search.

n_jobs=-1 makes it possible to use all your CPUs to run the individual CV fits in parallel. It's using mulitprocessing so the python GIL is not an issue.

Comments

8

Firstly, according to scikit-learn's benchmark (here), scikit-learn is already one of the fastest if not fastest SVM package around. Hence, you might want to consider other ways of speeding up the training.

As suggested by bavaza, you can try to multi-thread the training process. If you are using Scikit-learn's GridSearchCV class, you can easily set n_jobs argument to be larger than the default value of 1 to perform the training in parallel at the expense of using more memory.You can find its the documentationhere An example of how to use the class can be foundhere

Alternatively, you can take a look at Shogun Machine Learning Libraryhere

Shogun is designed for large scale machine learning with wrappers to many common svm packages and it is implemented in C/C++ with bindings for python. According to Scikit-learn's benchmark above, it's speed is comparable to scikit-learn. On other tasks (other than the one they demonstrated), it might be faster so it is worth giving a try.

Lastly, you can try to perform dimension reduction e.g. using PCA or randomized PCA to reduce the dimension of your feature vectors. That would speed up the training process. The documentation for the respective classes can be found in these 2 links:PCA,Randomized PCA . You can find examples on how to use them in Scikit-learn's examples section.

John Slade's user avatar
John Slade
13.1k2 gold badges28 silver badges20 bronze badges
answeredSep 27, 2012 at 3:30
lightalchemist's user avatar

Comments

3

If you are interested in only using the RBF kernel (or any other quadratic kernel for that matter), then I suggest using LIBSVM onMATLAB orOctave. I train a model of 7000 observations and 500 features in about 6 seconds.

The trick is to use precomputed kernels that LIBSVM provides, and use some matrix algebra to compute the kernel in one step instead of lopping over the data twice. The kernel takes about two seconds to build as opposed to a lot more using LIBSVM own RBF kernel. I presume you would be able to do so in Python usingNumPy, but I am not sure as I have not tried it.

Peter Mortensen's user avatar
Peter Mortensen
31.4k22 gold badges110 silver badges134 bronze badges
answeredOct 25, 2012 at 16:03
charlieBrown's user avatar

1 Comment

Generally speaking LibSVM is a good mature lib, but I think it's not the fastest and 7000 x 500 is very small problem to test.
2

Without going to much into comparing SVM libraries, I think the task you are describing (cross-validation) can benefit from real multi-threading (i.e. running several CPUs in parallel). If you are usingCPython, it does not take advantage of your (probably)-multi-core machine, due toGIL.

You can try other implementations of Python which don't have this limitation. SeePyPy orIronPython if you are willing to go to .NET.

Peter Mortensen's user avatar
Peter Mortensen
31.4k22 gold badges110 silver badges134 bronze badges
answeredFeb 15, 2012 at 18:58
bavaza's user avatar

4 Comments

Thanks bavaza I'll take a look into it. Assuming I do take advantage of my multicore computer, any other suggestions on speeding up my program? I was going figure out a way to cross validate across multiple threads anyways. However, I think I still need a speed up.
@bavaza , I have been running Python in Multiple cores for many years , it works very well. Please research multiprocessing lib of standard CPython.
@V3ss0n, thanks. Looks like a nice lib. As it uses processes and not threads, are you familiar with any context-switching penalties (e.g. when using a large worker pool)?
PyPy also has a GIL (even if they have an experimental project to implement an alternate memory management strategy); As some have said, to avoid the GIL the easiest way to go is still multiprocessing instead of using threading. I'm really not sure using IronPython will give better performance (with all the .NET overhead)
2

Trysvm_light!

It is a wicked-fast C implementation from theinfamous Thorsten Joachims at Cornell, with good Python bindings, and you can install it withpip install pysvmlight.

Peter Mortensen's user avatar
Peter Mortensen
31.4k22 gold badges110 silver badges134 bronze badges
answeredApr 29, 2013 at 18:58
Michael Matthew Toomim's user avatar

Comments

2

I think you can tryThunderSVM which utilizes GPUs.

answeredJul 22, 2020 at 10:05
ryh's user avatar

Comments

1

If your problem is in two classes, this wrapping of CUDA-based SVM with scikit-learn is useful:

https://github.com/niitsuma/gpusvm/tree/master/python

Peter Mortensen's user avatar
Peter Mortensen
31.4k22 gold badges110 silver badges134 bronze badges
answeredApr 28, 2015 at 12:20
niitsuma's user avatar

Comments

1

I'd consider using arandom forest to reduce the number of features you input.

There is an option with the ExtraTreesRegressor and ExtraTreesClassifier to generate feature importances. You can then use this information to input a subset of features into your SVM.

Peter Mortensen's user avatar
Peter Mortensen
31.4k22 gold badges110 silver badges134 bronze badges
answeredMay 31, 2013 at 4:52
denson's user avatar

Comments

0

I suggest looking at Scikit-Learn'sStochastic Gradient Descent implementation. The default hinge loss is a linear SVM. I've found it to be blazingly fast.

answeredNov 12, 2014 at 23:32
szxk's user avatar

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.