Usable in Java, Scala, Python, and R.
MLlib fits intoSpark's APIs and interoperates withNumPy in Python (as of Spark 0.9) and R libraries (as of Spark 1.5). You can use any Hadoop data source (e.g. HDFS, HBase, or local files), making it easy to plug into Hadoop workflows.
High-quality algorithms, 100x faster than MapReduce.
Spark excels at iterative computation, enabling MLlib to run fast. At the same time, we care about algorithmic performance: MLlib contains high-quality algorithms that leverage iteration, and can yield better results than the one-pass approximations sometimes used on MapReduce.
Spark runs on Hadoop, Apache Mesos, Kubernetes, standalone, or in the cloud, against diverse data sources.
You can run Spark using itsstandalone cluster mode, onEC2, onHadoop YARN, onMesos, or onKubernetes. Access data inHDFS,Apache Cassandra,Apache HBase,Apache Hive, and hundreds of other data sources.
MLlib contains many algorithms and utilities.
ML algorithms include:
ML workflow utilities include:
Other utilities include:
Refer to theMLlib guide for usage examples.
MLlib is developed as part of the Apache Spark project. It thus gets tested and updated with each Spark release.
If you have questions about the library, ask on theSpark mailing lists.
MLlib is still a rapidly growing project and welcomes contributions. If you'd like to submit an algorithm to MLlib, readhow to contribute to Spark and send us a patch!
To get started with MLlib: