MLlib (RDD-based)#
Classification#
| Classification model trained using Multinomial/Binary Logistic Regression. |
Train a classification model for Binary Logistic Regression using Stochastic Gradient Descent. | |
Train a classification model for Multinomial/Binary Logistic Regression using Limited-memory BFGS. | |
| Model for Support Vector Machines (SVMs). |
Train a Support Vector Machine (SVM) using Stochastic Gradient Descent. | |
| Model for Naive Bayes classifiers. |
Train a Multinomial Naive Bayes model. | |
Train or predict a logistic regression model on streaming data. |
Clustering#
| A clustering model derived from the bisecting k-means method. |
A bisecting k-means algorithm based on the paper "A comparison of document clustering techniques" by Steinbach, Karypis, and Kumar, with modification to fit Spark. | |
| A clustering model derived from the k-means method. |
| K-means clustering. |
| A clustering model derived from the Gaussian Mixture Model method. |
Learning algorithm for Gaussian Mixtures using the expectation-maximization algorithm. | |
| Model produced by |
Power Iteration Clustering (PIC), a scalable graph clustering algorithm. | |
| Provides methods to set k, decayFactor, timeUnit to configure the KMeans algorithm for fitting and predicting on incoming dstreams. |
| Clustering model which can perform an online update of the centroids. |
| Train Latent Dirichlet Allocation (LDA) model. |
| A clustering model derived from the LDA method. |
Evaluation#
| Evaluator for binary classification. |
| Evaluator for regression. |
| Evaluator for multiclass classification. |
| Evaluator for ranking algorithms. |
Feature#
| Normalizes samples individually to unit Lp norm |
| Represents a StandardScaler model that can transform vectors. |
| Standardizes features by removing the mean and scaling to unit variance using column summary statistics on the samples in the training set. |
| Maps a sequence of terms to their term frequencies using the hashing trick. |
| Represents an IDF model that can transform term frequency vectors. |
| Inverse document frequency (IDF). |
| Word2Vec creates vector representation of words in a text corpus. |
| class for Word2Vec model |
| Creates a ChiSquared feature selector. |
| Represents a Chi Squared selector model. |
| Scales each column of the vector, with the supplied weight vector. |
Frequency Pattern Mining#
| A Parallel FP-growth algorithm to mine frequent itemsets. |
| A FP-Growth model for mining frequent itemsets using the Parallel FP-Growth algorithm. |
A parallel PrefixSpan algorithm to mine frequent sequential patterns. | |
| Model fitted by PrefixSpan |
Vector and Matrix#
| |
| A dense vector represented by a value array. |
| A simple sparse vector class for passing data to MLlib. |
| Factory methods for working with vectors. |
| |
| Column-major dense matrix. |
| Sparse Matrix stored in CSC format. |
| |
| Represents QR factors. |
Distributed Representation#
| Represents a distributed matrix in blocks of local matrices. |
| Represents a matrix in coordinate format. |
Represents a distributively stored matrix backed by one or more RDDs. | |
| Represents a row of an IndexedRowMatrix. |
| Represents a row-oriented distributed Matrix with indexed rows. |
| Represents an entry of a CoordinateMatrix. |
| Represents a row-oriented distributed Matrix with no meaningful row indices. |
| Represents singular value decomposition (SVD) factors. |
Random#
Generator methods for creating RDDs comprised of i.i.d samples from some distribution. |
Recommendation#
| A matrix factorisation model trained by regularized alternating least-squares. |
| Alternating Least Squares matrix factorization |
| Represents a (user, product, rating) tuple. |
Regression#
| Class that represents the features and labels of a data point. |
| A linear model that has a vector of coefficients and an intercept. |
| A linear regression model derived from a least-squares fit. |
Train a linear regression model with no regularization using Stochastic Gradient Descent. | |
| A linear regression model derived from a least-squares fit with an l_2 penalty term. |
Train a regression model with L2-regularization using Stochastic Gradient Descent. | |
| A linear regression model derived from a least-squares fit with an l_1 penalty term. |
Train a regression model with L1-regularization using Stochastic Gradient Descent. | |
| Regression model for isotonic regression. |
Isotonic regression. | |
| Base class that has to be inherited by any StreamingLinearAlgorithm. |
| Train or predict a linear regression model on streaming data. |
Statistics#
| Trait for multivariate statistical summary of a data matrix. |
| Contains test results for the chi-squared hypothesis test. |
| Represents a (mu, sigma) tuple |
Estimate probability density at required points given an RDD of samples from the population. | |
| Contains test results for the chi-squared hypothesis test. |
| Contains test results for the Kolmogorov-Smirnov test. |
Tree#
| A decision tree model for classification or regression. |
Learning algorithm for a decision tree model for classification or regression. | |
| Represents a random forest model. |
Learning algorithm for a random forest model for classification or regression. | |
| Represents a gradient-boosted tree model. |
Learning algorithm for a gradient boosted trees model for classification or regression. |
Utilities#
Mixin for classes which can load saved models using its Scala implementation. | |
Mixin for models that provide save() through their Scala implementation. | |
Utils for generating linear data. | |
| Mixin for classes which can load saved models from files. |
| Helper methods to load, save and pre-process data used in MLlib. |
| Mixin for models and transformers which may be saved as files. |