mdoijade/cumlPublic

forked fromrapidsai/cuml

NotificationsYou must be signed in to change notification settings
Fork0
Star0

cuML - RAPIDS Machine Learning Library

License

Apache-2.0 license

0 stars 581 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 14,170 Commits
.conda		.conda
.github		.github
ci		ci
conda		conda
cpp		cpp
docs		docs
img		img
notebooks		notebooks
python		python
thirdparty/LICENSES		thirdparty/LICENSES
wiki		wiki
.gitattributes		.gitattributes
.gitignore		.gitignore
BUILD.md		BUILD.md
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
build.sh		build.sh
codecov.yml		codecov.yml
print_env.sh		print_env.sh
readthedocs.yml		readthedocs.yml

Repository files navigation

cuML - GPU Machine Learning Algorithms

cuML is a suite of libraries that implement machine learning algorithms and mathematical primitives functions that share compatible APIs with otherRAPIDS projects.

cuML enables data scientists, researchers, and software engineers to runtraditional tabular ML tasks on GPUs without going into the details of CUDAprogramming. In most cases, cuML's Python API matches the API fromscikit-learn.

For large datasets, these GPU-based implementations can complete 10-50x fasterthan their CPU equivalents. For details on performance, see thecuML BenchmarksNotebook.

As an example, the following Python snippet loads input and computes DBSCAN clusters, all on GPU, using cuDF:

importcudffromcuml.clusterimportDBSCAN# Create and populate a GPU DataFramegdf_float=cudf.DataFrame()gdf_float['0']= [1.0,2.0,5.0]gdf_float['1']= [4.0,2.0,1.0]gdf_float['2']= [4.0,2.0,1.0]# Setup and fit clustersdbscan_float=DBSCAN(eps=1.0,min_samples=1)dbscan_float.fit(gdf_float)print(dbscan_float.labels_)

Output:

0    01    12    2dtype: int32

cuML also features multi-GPU and multi-node-multi-GPU operation, usingDask, for agrowing list of algorithms. The following Python snippet reads input from a CSV file and performsa NearestNeighbors query across a cluster of Dask workers, using multiple GPUs on a single node:

Initialize aLocalCUDACluster configured withUCX for fast transport of CUDA arrays

# Initialize UCX for high-speed transport of CUDA arraysfromdask_cudaimportLocalCUDACluster# Create a Dask single-node CUDA cluster w/ one worker per devicecluster=LocalCUDACluster(protocol="ucx",enable_tcp_over_ucx=True,enable_nvlink=True,enable_infiniband=False)

Load data and performk-Nearest Neighbors search.cuml.dask estimators also supportDask.Array as input:

fromdask.distributedimportClientclient=Client(cluster)# Read CSV file in parallel across workersimportdask_cudfdf=dask_cudf.read_csv("/path/to/csv")# Fit a NearestNeighbors model and query itfromcuml.dask.neighborsimportNearestNeighborsnn=NearestNeighbors(n_neighbors=10,client=client)nn.fit(df)neighbors=nn.kneighbors(df)

For additional examples, browse our completeAPIdocumentation, or check out ourexamplewalkthroughnotebooks. Finally, youcan find complete end-to-end examples in thenotebooks-contribrepo.

Supported Algorithms

Category	Algorithm	Notes
Clustering	Density-Based Spatial Clustering of Applications with Noise (DBSCAN)	Multi-node multi-GPU via Dask
	K-Means	Multi-node multi-GPU via Dask
Dimensionality Reduction	Principal Components Analysis (PCA)	Multi-node multi-GPU via Dask
	Incremental PCA
	Truncated Singular Value Decomposition (tSVD)	Multi-node multi-GPU via Dask
	Uniform Manifold Approximation and Projection (UMAP)	Multi-node multi-GPU Inference via Dask
	Random Projection
	t-Distributed Stochastic Neighbor Embedding (TSNE)
Linear Models for Regression or Classification	Linear Regression (OLS)	Multi-node multi-GPU via Dask
	Linear Regression with Lasso or Ridge Regularization	Multi-node multi-GPU via Dask
	ElasticNet Regression
	LARS Regression	(experimental)
	Logistic Regression	Multi-node multi-GPU via Dask-GLMdemo
	Naive Bayes	Multi-node multi-GPU via Dask
	Stochastic Gradient Descent (SGD), Coordinate Descent (CD), and Quasi-Newton (QN) (including L-BFGS and OWL-QN) solvers for linear models
Nonlinear Models for Regression or Classification	Random Forest (RF) Classification	Experimental multi-node multi-GPU via Dask
	Random Forest (RF) Regression	Experimental multi-node multi-GPU via Dask
	Inference for decision tree-based models	Forest Inference Library (FIL)
	K-Nearest Neighbors (KNN) Classification	Multi-node multi-GPU via Dask+UCX, usesFaiss for Nearest Neighbors Query.
	K-Nearest Neighbors (KNN) Regression	Multi-node multi-GPU via Dask+UCX, usesFaiss for Nearest Neighbors Query.
	Support Vector Machine Classifier (SVC)
	Epsilon-Support Vector Regression (SVR)
Time Series	Holt-Winters Exponential Smoothing
	Auto-regressive Integrated Moving Average (ARIMA)	Supports seasonality (SARIMA)
Model Explanation	SHAP Kernel Explainer	Based on SHAP (experimental)
	SHAP Permutation Explainer	Based on SHAP (experimental)
Other	K-Nearest Neighbors (KNN) Search	Multi-node multi-GPU via Dask+UCX, usesFaiss for Nearest Neighbors Query.

Installation

Seethe RAPIDS ReleaseSelector for the commandline to install either nightly or official release cuML packages via Conda orDocker.

Build/Install from Source

See the buildguide.

Contributing

Please see ourguide for contributing to cuML.

References

The RAPIDS team has a number of blogs with deeper technical dives and examples.You can find them here on Medium.

For additional details on the technologies behind cuML, as well as a broader overview of the Python Machine Learning landscape, seeMachine Learning in Python: Main developments and technology trends in data science, machine learning, and artificial intelligence (2020) by Sebastian Raschka, Joshua Patterson, and Corey Nolet.

Please consider citing this when using cuML in a project. You can use the citation BibTeX:

@article{raschka2020machine,title={Machine Learning in Python: Main developments and technology trends in data science, machine learning, and artificial intelligence},author={Raschka, Sebastian and Patterson, Joshua and Nolet, Corey},journal={arXiv preprint arXiv:2002.04803},year={2020}}

Contact

Find out more details on theRAPIDS site

Open GPU Data Science

The RAPIDS suite of open source software libraries aim to enable execution of end-to-end data science and analytics pipelines entirely on GPUs. It relies on NVIDIA® CUDA® primitives for low-level compute optimization, but exposing that GPU parallelism and high-bandwidth memory speed through user-friendly Python interfaces.

About

cuML - RAPIDS Machine Learning Library

Releases

38tags

Packages

No packages published

Languages

Cuda39.1%
Python35.2%
C++21.3%
Jupyter Notebook2.8%
CMake0.6%
Shell0.6%
Other0.4%

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

License

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

cuML - GPU Machine Learning Algorithms

Supported Algorithms

Installation

Build/Install from Source

Contributing

References

Contact

Open GPU Data Science

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages

Languages

Movatterモバイル変換

License

mdoijade/cuml

Folders and files

Latest commit

History

Repository files navigation

cuML - GPU Machine Learning Algorithms

Supported Algorithms

Installation

Build/Install from Source

Contributing

References

Contact

Open GPU Data Science

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages0

Languages

Packages