MLlib (DataFrame-based)#
Note
From Apache Spark 4.0.0, all builtin algorithms support Spark Connect.
Pipeline APIs#
Abstract class for transformers that transform one dataset into another. | |
Abstract class for transformers that take one input column, apply transformation, and output the result as a new column. | |
Abstract class for estimators that fit models to data. | |
| Abstract class for models that are fitted by estimators. |
Estimator for prediction tasks (regression and classification). | |
Model for prediction tasks (regression and classification). | |
| A simple pipeline, which acts as an estimator. |
| Represents a compiled pipeline with transformers and fitted models. |
Parameters#
| A param with self-contained documentation. |
| Components that take parameters. |
Factory methods for common type conversion functions forParam.typeConverter. |
Feature#
| Binarize a column of continuous features given a threshold. |
| LSH class for Euclidean distance metrics. |
| Model fitted by |
| Maps a column of continuous features to a column of feature buckets. |
| Chi-Squared feature selection, which selects categorical features to use for predicting a categorical label. |
| Model fitted by |
| Extracts a vocabulary from document collections and generates a |
| Model fitted by |
| A feature transformer that takes the 1D discrete cosine transform of a real vector. |
| Outputs the Hadamard product (i.e., the element-wise product) of each input vector with a provided "weight" vector. |
| Feature hashing projects a set of categorical or numerical features into a feature vector of specified dimension (typically substantially smaller than that of the original feature space). |
| Maps a sequence of terms to their term frequencies using the hashing trick. |
| Compute the Inverse Document Frequency (IDF) given a collection of documents. |
| Model fitted by |
| Imputation estimator for completing missing values, using the mean, median or mode of the columns in which the missing values are located. |
| Model fitted by |
| A |
| Implements the feature interaction transform. |
| Rescale each feature individually to range [-1, 1] by dividing through the largest maximum absolute value in each feature. |
| Model fitted by |
| LSH class for Jaccard distance. |
| Model produced by |
| Rescale each feature individually to a common range [min, max] linearly using column summary statistics, which is also known as min-max normalization or Rescaling. |
| Model fitted by |
| A feature transformer that converts the input array of strings into an array of n-grams. |
| Normalize a vector to have unit norm using the given p-norm. |
| A one-hot encoder that maps a column of category indices to a column of binary vectors, with at most a single one-value per row that indicates the input category index. |
| Model fitted by |
| PCA trains a model to project vectors to a lower dimensional space of the top |
| Model fitted by |
| Perform feature expansion in a polynomial space. |
|
|
| RobustScaler removes the median and scales the data according to the quantile range. |
| Model fitted by |
| A regex based tokenizer that extracts tokens either by using the provided regex pattern (in Java dialect) to split the text (default) or repeatedly matching the regex (if gaps is false). |
| Implements the transforms required for fitting a dataset against an R model formula. |
| Model fitted by |
| Implements the transforms which are defined by SQL statement. |
| Standardizes features by removing the mean and scaling to unit variance using column summary statistics on the samples in the training set. |
| Model fitted by |
| A feature transformer that filters out stop words from input. |
| A label indexer that maps a string column of labels to an ML column of label indices. |
| Model fitted by |
| Target Encoding maps a column of categorical indices into a numerical feature derived from the target. |
| Model fitted by |
| A tokenizer that converts the input string to lowercase and then splits it by white spaces. |
| Feature selector based on univariate statistical tests against labels. |
| Model fitted by |
| Feature selector that removes all low-variance features. |
| Model fitted by |
| A feature transformer that merges multiple columns into a vector column. |
| Class for indexing categorical feature columns in a dataset ofVector. |
| Model fitted by |
| A feature transformer that adds size information to the metadata of a vector column. |
| This class takes a feature vector and outputs a new feature vector with a subarray of the original features. |
| Word2Vec trains a model ofMap(String, Vector), i.e. transforms a word into a code for further natural language processing or machine learning process. |
| Model fitted by |
Classification#
| This binary classifier optimizes the Hinge Loss using the OWLQN optimizer. |
| Model fitted by LinearSVC. |
| Abstraction for LinearSVC Results for a given model. |
| Abstraction for LinearSVC Training results. |
| Logistic regression. |
| Model fitted by LogisticRegression. |
| Abstraction for Logistic Regression Results for a given model. |
| Abstraction for multinomial Logistic Regression Training results. |
| Binary Logistic regression results for a given model. |
Binary Logistic regression training results for a given model. | |
| Decision tree learning algorithm for classification. |
| Model fitted by DecisionTreeClassifier. |
| Gradient-Boosted Trees (GBTs) learning algorithm for classification. |
| Model fitted by GBTClassifier. |
| Random Forest learning algorithm for classification. |
| Model fitted by RandomForestClassifier. |
| Abstraction for RandomForestClassification Results for a given model. |
Abstraction for RandomForestClassificationTraining Training results. | |
BinaryRandomForestClassification results for a given model. | |
BinaryRandomForestClassification training results for a given model. | |
| Naive Bayes Classifiers. |
| Model fitted by NaiveBayes. |
| Classifier trainer based on the Multilayer Perceptron. |
Model fitted by MultilayerPerceptronClassifier. | |
Abstraction for MultilayerPerceptronClassifier Results for a given model. | |
Abstraction for MultilayerPerceptronClassifier Training results. | |
| Reduction of Multiclass Classification to Binary Classification. |
| Model fitted by OneVsRest. |
| Factorization Machines learning algorithm for classification. |
| Model fitted by |
| Abstraction for FMClassifier Results for a given model. |
| Abstraction for FMClassifier Training results. |
Clustering#
| A bisecting k-means algorithm based on the paper "A comparison of document clustering techniques" by Steinbach, Karypis, and Kumar, with modification to fit Spark. |
| Model fitted by BisectingKMeans. |
| Bisecting KMeans clustering results for a given model. |
| K-means clustering with a k-means++ like initialization mode (the k-means|| algorithm by Bahmani et al). |
| Model fitted by KMeans. |
| Summary of KMeans. |
| GaussianMixture clustering. |
| Model fitted by GaussianMixture. |
| Gaussian mixture clustering results for a given model. |
| Latent Dirichlet Allocation (LDA), a topic model designed for text documents. |
| Latent Dirichlet Allocation (LDA) model. |
| Local (non-distributed) model fitted by |
| Distributed model fitted by |
| Power Iteration Clustering (PIC), a scalable graph clustering algorithm developed byLin and Cohen. |
Functions#
| Converts a column of array of numeric type into a column of pyspark.ml.linalg.DenseVector instances |
| Converts a column of MLlib sparse/dense vectors into a column of dense arrays. |
| Given a function which loads a model and returns apredict function for inference over a batch of numpy inputs, returns a Pandas UDF wrapper for inference over a Spark DataFrame. |
Vector and Matrix#
| |
| A dense vector represented by a value array. |
| A simple sparse vector class for passing data to MLlib. |
| Factory methods for working with vectors. |
| |
| Column-major dense matrix. |
| Sparse Matrix stored in CSC format. |
|
Recommendation#
| Alternating Least Squares (ALS) matrix factorization. |
| Model fitted by ALS. |
Regression#
| Accelerated Failure Time (AFT) Model Survival Regression |
| Model fitted by |
| Decision tree learning algorithm for regression. |
| Model fitted by |
| Gradient-Boosted Trees (GBTs) learning algorithm for regression. |
| Model fitted by |
| Generalized Linear Regression. |
| Model fitted by |
| Generalized linear regression results evaluated on a dataset. |
Generalized linear regression training results. | |
| Currently implemented using parallelized pool adjacent violators algorithm. |
| Model fitted by |
| Linear regression. |
| Model fitted by |
| Linear regression results evaluated on a dataset. |
| Linear regression training results. |
| Random Forest learning algorithm for regression. |
| Model fitted by |
| Factorization Machines learning algorithm for regression. |
| Model fitted by |
Statistics#
Conduct Pearson's independence test for every feature against the label. | |
Compute the correlation matrix for the input dataset of Vectors using the specified method. | |
Conduct the two-sided Kolmogorov Smirnov (KS) test for data sampled from a continuous distribution. | |
| Represents a (mean, cov) tuple |
Tools for vectorized statistics on MLlib Vectors. | |
| A builder object that provides summary statistics about a given column. |
Tuning#
Builder for a param grid used in grid search-based model selection. | |
| K-fold cross validation performs model selection by splitting the dataset into a set of non-overlapping randomly partitioned folds which are used as separate training and test datasets e.g., with k=3 folds, K-fold cross validation will generate 3 (training, test) dataset pairs, each of which uses 2/3 of the data for training and 1/3 for testing. |
| CrossValidatorModel contains the model with the highest average cross-validation metric across folds and uses this model to transform input data. |
| Validation for hyper-parameter tuning. |
| Model from train validation split. |
Evaluation#
Base class for evaluators that compute metrics from predictions. | |
| Evaluator for binary classification, which expects input columns rawPrediction, label and an optional weight column. |
| Evaluator for Regression, which expects input columns prediction, label and an optional weight column. |
| Evaluator for Multiclass Classification, which expects input columns: prediction, label, weight (optional) and probabilityCol (only for logLoss). |
| Evaluator for Multilabel Classification, which expects two input columns: prediction and label. |
| Evaluator for Clustering results, which expects two input columns: prediction and features. |
| Evaluator for Ranking, which expects two input columns: prediction and label. |
Frequency Pattern Mining#
| A parallel FP-growth algorithm to mine frequent itemsets. |
| Model fitted by FPGrowth. |
| A parallel PrefixSpan algorithm to mine frequent sequential patterns. |
Image#
Internal class forpyspark.ml.image.ImageSchema attribute. | |
Internal class forpyspark.ml.image.ImageSchema attribute. |
Distributor#
| A class to support distributed training on PyTorch and PyTorch Lightning using PySpark. |
|
Utilities#
Base class for MLWriter and MLReader. | |
Helper trait for making simple | |
| Specialization of |
Helper trait for making simple | |
| Specialization of |
Utility class that can save ML instances in different formats. | |
Base class for models that provides Training summary. | |
Object with a unique ID. | |
Mixin for instances that provide | |
| Utility class that can load ML instances. |
Mixin for ML instances that provide | |
| Utility class that can save ML instances. |