MLUtils#
- classpyspark.mllib.util.MLUtils[source]#
Helper methods to load, save and pre-process data used in MLlib.
New in version 1.0.0.
Methods
appendBias(data)Returns a new vector with1.0 (bias) appended to the end of the input vector.
convertMatrixColumnsFromML(dataset, *cols)Converts matrix columns in an input DataFrame to the
pyspark.mllib.linalg.Matrixtype from the newpyspark.ml.linalg.Matrixtype under thespark.ml package.convertMatrixColumnsToML(dataset, *cols)Converts matrix columns in an input DataFrame from the
pyspark.mllib.linalg.Matrixtype to the newpyspark.ml.linalg.Matrixtype under thespark.ml package.convertVectorColumnsFromML(dataset, *cols)Converts vector columns in an input DataFrame to the
pyspark.mllib.linalg.Vectortype from the newpyspark.ml.linalg.Vectortype under thespark.ml package.convertVectorColumnsToML(dataset, *cols)Converts vector columns in an input DataFrame from the
pyspark.mllib.linalg.Vectortype to the newpyspark.ml.linalg.Vectortype under thespark.ml package.loadLabeledPoints(sc, path[, minPartitions])Load labeled points saved using RDD.saveAsTextFile.
loadLibSVMFile(sc, path[, numFeatures, ...])Loads labeled data in the LIBSVM format into an RDD of LabeledPoint.
loadVectors(sc, path)Loads vectors saved usingRDD[Vector].saveAsTextFile with the default number of partitions.
saveAsLibSVMFile(data, dir)Save labeled data in LIBSVM format.
Methods Documentation
- staticappendBias(data)[source]#
Returns a new vector with1.0 (bias) appended tothe end of the input vector.
New in version 1.5.0.
- staticconvertMatrixColumnsFromML(dataset,*cols)[source]#
Converts matrix columns in an input DataFrame to the
pyspark.mllib.linalg.Matrixtype from the newpyspark.ml.linalg.Matrixtype under thespark.mlpackage.New in version 2.0.0.
- Parameters
- dataset
pyspark.sql.DataFrame input dataset
- *colsstr
Matrix columns to be converted.
Old matrix columns will be ignored. If unspecified, all newmatrix columns will be converted except nested ones.
- dataset
- Returns
pyspark.sql.DataFramethe input dataset with new matrix columns converted to theold matrix type
Examples
>>>importpyspark>>>frompyspark.ml.linalgimportMatrices>>>frompyspark.mllib.utilimportMLUtils>>>df=spark.createDataFrame(...[(0,Matrices.sparse(2,2,[0,2,3],[0,1,1],[2,3,4]),...Matrices.dense(2,2,range(4)))],["id","x","y"])>>>r1=MLUtils.convertMatrixColumnsFromML(df).first()>>>isinstance(r1.x,pyspark.mllib.linalg.SparseMatrix)True>>>isinstance(r1.y,pyspark.mllib.linalg.DenseMatrix)True>>>r2=MLUtils.convertMatrixColumnsFromML(df,"x").first()>>>isinstance(r2.x,pyspark.mllib.linalg.SparseMatrix)True>>>isinstance(r2.y,pyspark.ml.linalg.DenseMatrix)True
- staticconvertMatrixColumnsToML(dataset,*cols)[source]#
Converts matrix columns in an input DataFrame from the
pyspark.mllib.linalg.Matrixtype to the newpyspark.ml.linalg.Matrixtype under thespark.mlpackage.New in version 2.0.0.
- Parameters
- dataset
pyspark.sql.DataFrame input dataset
- *colsstr
Matrix columns to be converted.
New matrix columns will be ignored. If unspecified, all oldmatrix columns will be converted excepted nested ones.
- dataset
- Returns
pyspark.sql.DataFramethe input dataset with old matrix columns converted to thenew matrix type
Examples
>>>importpyspark>>>frompyspark.mllib.linalgimportMatrices>>>frompyspark.mllib.utilimportMLUtils>>>df=spark.createDataFrame(...[(0,Matrices.sparse(2,2,[0,2,3],[0,1,1],[2,3,4]),...Matrices.dense(2,2,range(4)))],["id","x","y"])>>>r1=MLUtils.convertMatrixColumnsToML(df).first()>>>isinstance(r1.x,pyspark.ml.linalg.SparseMatrix)True>>>isinstance(r1.y,pyspark.ml.linalg.DenseMatrix)True>>>r2=MLUtils.convertMatrixColumnsToML(df,"x").first()>>>isinstance(r2.x,pyspark.ml.linalg.SparseMatrix)True>>>isinstance(r2.y,pyspark.mllib.linalg.DenseMatrix)True
- staticconvertVectorColumnsFromML(dataset,*cols)[source]#
Converts vector columns in an input DataFrame to the
pyspark.mllib.linalg.Vectortype from the newpyspark.ml.linalg.Vectortype under thespark.mlpackage.New in version 2.0.0.
- Parameters
- dataset
pyspark.sql.DataFrame input dataset
- *colsstr
Vector columns to be converted.
Old vector columns will be ignored. If unspecified, all newvector columns will be converted except nested ones.
- dataset
- Returns
pyspark.sql.DataFramethe input dataset with new vector columns converted to theold vector type
Examples
>>>importpyspark>>>frompyspark.ml.linalgimportVectors>>>frompyspark.mllib.utilimportMLUtils>>>df=spark.createDataFrame(...[(0,Vectors.sparse(2,[1],[1.0]),Vectors.dense(2.0,3.0))],...["id","x","y"])>>>r1=MLUtils.convertVectorColumnsFromML(df).first()>>>isinstance(r1.x,pyspark.mllib.linalg.SparseVector)True>>>isinstance(r1.y,pyspark.mllib.linalg.DenseVector)True>>>r2=MLUtils.convertVectorColumnsFromML(df,"x").first()>>>isinstance(r2.x,pyspark.mllib.linalg.SparseVector)True>>>isinstance(r2.y,pyspark.ml.linalg.DenseVector)True
- staticconvertVectorColumnsToML(dataset,*cols)[source]#
Converts vector columns in an input DataFrame from the
pyspark.mllib.linalg.Vectortype to the newpyspark.ml.linalg.Vectortype under thespark.mlpackage.New in version 2.0.0.
- Parameters
- dataset
pyspark.sql.DataFrame input dataset
- *colsstr
Vector columns to be converted.
New vector columns will be ignored. If unspecified, all oldvector columns will be converted excepted nested ones.
- dataset
- Returns
pyspark.sql.DataFramethe input dataset with old vector columns converted to thenew vector type
Examples
>>>importpyspark>>>frompyspark.mllib.linalgimportVectors>>>frompyspark.mllib.utilimportMLUtils>>>df=spark.createDataFrame(...[(0,Vectors.sparse(2,[1],[1.0]),Vectors.dense(2.0,3.0))],...["id","x","y"])>>>r1=MLUtils.convertVectorColumnsToML(df).first()>>>isinstance(r1.x,pyspark.ml.linalg.SparseVector)True>>>isinstance(r1.y,pyspark.ml.linalg.DenseVector)True>>>r2=MLUtils.convertVectorColumnsToML(df,"x").first()>>>isinstance(r2.x,pyspark.ml.linalg.SparseVector)True>>>isinstance(r2.y,pyspark.mllib.linalg.DenseVector)True
- staticloadLabeledPoints(sc,path,minPartitions=None)[source]#
Load labeled points saved using RDD.saveAsTextFile.
New in version 1.0.0.
- Parameters
- sc
pyspark.SparkContext Spark context
- pathstr
file or directory path in any Hadoop-supported file system URI
- minPartitionsint, optional
min number of partitions
- sc
- Returns
pyspark.RDDlabeled data stored as an RDD of LabeledPoint
Examples
>>>fromtempfileimportNamedTemporaryFile>>>frompyspark.mllib.utilimportMLUtils>>>frompyspark.mllib.regressionimportLabeledPoint>>>examples=[LabeledPoint(1.1,Vectors.sparse(3,[(0,-1.23),(2,4.56e-7)])),...LabeledPoint(0.0,Vectors.dense([1.01,2.02,3.03]))]>>>tempFile=NamedTemporaryFile(delete=True)>>>tempFile.close()>>>sc.parallelize(examples,1).saveAsTextFile(tempFile.name)>>>MLUtils.loadLabeledPoints(sc,tempFile.name).collect()[LabeledPoint(1.1, (3,[0,2],[-1.23,4.56e-07])), LabeledPoint(0.0, [1.01,2.02,3.03])]
- staticloadLibSVMFile(sc,path,numFeatures=-1,minPartitions=None)[source]#
Loads labeled data in the LIBSVM format into an RDD ofLabeledPoint. The LIBSVM format is a text-based format used byLIBSVM and LIBLINEAR. Each line represents a labeled sparsefeature vector using the following format:
label index1:value1 index2:value2 …
where the indices are one-based and in ascending order. Thismethod parses each line into a LabeledPoint, where the featureindices are converted to zero-based.
New in version 1.0.0.
- Parameters
- sc
pyspark.SparkContext Spark context
- pathstr
file or directory path in any Hadoop-supported file system URI
- numFeaturesint, optional
number of features, which will be determinedfrom the input data if a nonpositive valueis given. This is useful when the dataset isalready split into multiple files and youwant to load them separately, because somefeatures may not present in certain files,which leads to inconsistent featuredimensions.
- minPartitionsint, optional
min number of partitions
- sc
- Returns
pyspark.RDDlabeled data stored as an RDD of LabeledPoint
Examples
>>>fromtempfileimportNamedTemporaryFile>>>frompyspark.mllib.utilimportMLUtils>>>frompyspark.mllib.regressionimportLabeledPoint>>>tempFile=NamedTemporaryFile(delete=True)>>>_=tempFile.write(b"+1 1:1.0 3:2.0 5:3.0\n-1\n-1 2:4.0 4:5.0 6:6.0")>>>tempFile.flush()>>>examples=MLUtils.loadLibSVMFile(sc,tempFile.name).collect()>>>tempFile.close()>>>examples[0]LabeledPoint(1.0, (6,[0,2,4],[1.0,2.0,3.0]))>>>examples[1]LabeledPoint(-1.0, (6,[],[]))>>>examples[2]LabeledPoint(-1.0, (6,[1,3,5],[4.0,5.0,6.0]))
- staticloadVectors(sc,path)[source]#
Loads vectors saved usingRDD[Vector].saveAsTextFilewith the default number of partitions.
New in version 1.5.0.
- staticsaveAsLibSVMFile(data,dir)[source]#
Save labeled data in LIBSVM format.
New in version 1.0.0.
- Parameters
- data
pyspark.RDD an RDD of LabeledPoint to be saved
- dirstr
directory to save the data
- data
Examples
>>>fromtempfileimportNamedTemporaryFile>>>fromfileinputimportinput>>>frompyspark.mllib.regressionimportLabeledPoint>>>fromglobimportglob>>>frompyspark.mllib.utilimportMLUtils>>>examples=[LabeledPoint(1.1,Vectors.sparse(3,[(0,1.23),(2,4.56)])),...LabeledPoint(0.0,Vectors.dense([1.01,2.02,3.03]))]>>>tempFile=NamedTemporaryFile(delete=True)>>>tempFile.close()>>>MLUtils.saveAsLibSVMFile(sc.parallelize(examples),tempFile.name)>>>''.join(sorted(input(glob(tempFile.name+"/part-0000*"))))'0.0 1:1.01 2:2.02 3:3.03\n1.1 1:1.23 3:4.56\n'