We don’t allow questions seeking recommendations for software libraries, tutorials, tools, books, or other off-site resources. You can edit the question so it can be answered with facts and citations.
Closed5 years ago.
The community reviewed whether to reopen this question1 year ago and left it closed:
Original close reason(s) were not resolved
I'm building some predictive models in Python and have been using scikits learn's SVM implementation. It's been really great, easy to use, and relatively fast.
Unfortunately, I'm beginning to become constrained by my runtime. I run a rbf SVM on a full dataset of about 4 - 5000 with 650 features. Each run takes about a minute. But with a 5 fold cross validation + grid search (using a coarse to fine search), it's getting a bit unfeasible for my task at hand. So generally, do people have any recommendations in terms of the fastest SVM implementation that can be used in Python? That, or any ways to speed up my modeling?
I've heard of LIBSVM's GPU implementation, which seems like it could work. I don't know of any other GPU SVM implementations usable in Python, but it would definitely be open to others. Also, does using the GPU significantly increase runtime?
I've also heard that there are ways of approximating the rbf SVM by using a linear SVM + feature map in scikits. Not sure what people think about this approach. Again, anyone using this approach, is it a significant increase in runtime?
All ideas for increasing the speed of program is most welcome.
10 Answers10
The most scalable kernel SVM implementation I know of isLaSVM. It's written in C hence wrap-able in Python if you knowCython,ctypes orcffi. Alternatively you can use it from the command line. You can use the utilities insklearn.datasets to load convert data from aNumPy or CSR format into svmlight formatted files that LaSVM can use as training / test set.
7 Comments
Alternatively you can run the grid search on 1000 random samples instead of the full dataset:
>>> from sklearn.cross_validation import ShuffleSplit>>> cv = ShuffleSplit(3, test_fraction=0.2, train_fraction=0.2, random_state=0)>>> gs = GridSeachCV(clf, params_grid, cv=cv, n_jobs=-1, verbose=2)>>> gs.fit(X, y)It's very likely that the optimal parameters for 5000 samples will be very close to the optimal parameters for 1000 samples. So that's a good way to start your coarse grid search.
n_jobs=-1 makes it possible to use all your CPUs to run the individual CV fits in parallel. It's using mulitprocessing so the python GIL is not an issue.
Comments
Firstly, according to scikit-learn's benchmark (here), scikit-learn is already one of the fastest if not fastest SVM package around. Hence, you might want to consider other ways of speeding up the training.
As suggested by bavaza, you can try to multi-thread the training process. If you are using Scikit-learn's GridSearchCV class, you can easily set n_jobs argument to be larger than the default value of 1 to perform the training in parallel at the expense of using more memory.You can find its the documentationhere An example of how to use the class can be foundhere
Alternatively, you can take a look at Shogun Machine Learning Libraryhere
Shogun is designed for large scale machine learning with wrappers to many common svm packages and it is implemented in C/C++ with bindings for python. According to Scikit-learn's benchmark above, it's speed is comparable to scikit-learn. On other tasks (other than the one they demonstrated), it might be faster so it is worth giving a try.
Lastly, you can try to perform dimension reduction e.g. using PCA or randomized PCA to reduce the dimension of your feature vectors. That would speed up the training process. The documentation for the respective classes can be found in these 2 links:PCA,Randomized PCA . You can find examples on how to use them in Scikit-learn's examples section.
Comments
If you are interested in only using the RBF kernel (or any other quadratic kernel for that matter), then I suggest using LIBSVM onMATLAB orOctave. I train a model of 7000 observations and 500 features in about 6 seconds.
The trick is to use precomputed kernels that LIBSVM provides, and use some matrix algebra to compute the kernel in one step instead of lopping over the data twice. The kernel takes about two seconds to build as opposed to a lot more using LIBSVM own RBF kernel. I presume you would be able to do so in Python usingNumPy, but I am not sure as I have not tried it.
1 Comment
Without going to much into comparing SVM libraries, I think the task you are describing (cross-validation) can benefit from real multi-threading (i.e. running several CPUs in parallel). If you are usingCPython, it does not take advantage of your (probably)-multi-core machine, due toGIL.
You can try other implementations of Python which don't have this limitation. SeePyPy orIronPython if you are willing to go to .NET.
4 Comments
Trysvm_light!
It is a wicked-fast C implementation from theinfamous Thorsten Joachims at Cornell, with good Python bindings, and you can install it withpip install pysvmlight.
Comments
I think you can tryThunderSVM which utilizes GPUs.
Comments
If your problem is in two classes, this wrapping of CUDA-based SVM with scikit-learn is useful:
Comments
I'd consider using arandom forest to reduce the number of features you input.
There is an option with the ExtraTreesRegressor and ExtraTreesClassifier to generate feature importances. You can then use this information to input a subset of features into your SVM.
Comments
I suggest looking at Scikit-Learn'sStochastic Gradient Descent implementation. The default hinge loss is a linear SVM. I've found it to be blazingly fast.
Comments
Explore related questions
See similar questions with these tags.

