MLUtils #

classpyspark.mllib.util.MLUtils[source]#

Helper methods to load, save and pre-process data used in MLlib.

New in version 1.0.0.

Methods

`appendBias`(data)	Returns a new vector with1.0 (bias) appended to the end of the input vector.
`convertMatrixColumnsFromML`(dataset, *cols)	Converts matrix columns in an input DataFrame to the`pyspark.mllib.linalg.Matrix` type from the new`pyspark.ml.linalg.Matrix` type under thespark.ml package.
`convertMatrixColumnsToML`(dataset, *cols)	Converts matrix columns in an input DataFrame from the`pyspark.mllib.linalg.Matrix` type to the new`pyspark.ml.linalg.Matrix` type under thespark.ml package.
`convertVectorColumnsFromML`(dataset, *cols)	Converts vector columns in an input DataFrame to the`pyspark.mllib.linalg.Vector` type from the new`pyspark.ml.linalg.Vector` type under thespark.ml package.
`convertVectorColumnsToML`(dataset, *cols)	Converts vector columns in an input DataFrame from the`pyspark.mllib.linalg.Vector` type to the new`pyspark.ml.linalg.Vector` type under thespark.ml package.
`loadLabeledPoints`(sc, path[, minPartitions])	Load labeled points saved using RDD.saveAsTextFile.
`loadLibSVMFile`(sc, path[, numFeatures, ...])	Loads labeled data in the LIBSVM format into an RDD of LabeledPoint.
`loadVectors`(sc, path)	Loads vectors saved usingRDD[Vector].saveAsTextFile with the default number of partitions.
`saveAsLibSVMFile`(data, dir)	Save labeled data in LIBSVM format.

Methods Documentation

staticappendBias(data)[source]#: Returns a new vector with1.0 (bias) appended tothe end of the input vector.
New in version 1.5.0.

staticconvertMatrixColumnsFromML(dataset,*cols)[source]#

Converts matrix columns in an input DataFrame to thepyspark.mllib.linalg.Matrix type from the newpyspark.ml.linalg.Matrix type under thespark.mlpackage.

New in version 2.0.0.

Parameters

datasetpyspark.sql.DataFrame

input dataset

*colsstr

Matrix columns to be converted.

Old matrix columns will be ignored. If unspecified, all newmatrix columns will be converted except nested ones.

Returns

pyspark.sql.DataFrame: the input dataset with new matrix columns converted to theold matrix type

Examples

>>>importpyspark>>>frompyspark.ml.linalgimportMatrices>>>frompyspark.mllib.utilimportMLUtils>>>df=spark.createDataFrame(...[(0,Matrices.sparse(2,2,[0,2,3],[0,1,1],[2,3,4]),...Matrices.dense(2,2,range(4)))],["id","x","y"])>>>r1=MLUtils.convertMatrixColumnsFromML(df).first()>>>isinstance(r1.x,pyspark.mllib.linalg.SparseMatrix)True>>>isinstance(r1.y,pyspark.mllib.linalg.DenseMatrix)True>>>r2=MLUtils.convertMatrixColumnsFromML(df,"x").first()>>>isinstance(r2.x,pyspark.mllib.linalg.SparseMatrix)True>>>isinstance(r2.y,pyspark.ml.linalg.DenseMatrix)True

staticconvertMatrixColumnsToML(dataset,*cols)[source]#

Converts matrix columns in an input DataFrame from thepyspark.mllib.linalg.Matrix type to the newpyspark.ml.linalg.Matrix type under thespark.mlpackage.

New in version 2.0.0.

Parameters

datasetpyspark.sql.DataFrame

input dataset

*colsstr

Matrix columns to be converted.

New matrix columns will be ignored. If unspecified, all oldmatrix columns will be converted excepted nested ones.

Returns

pyspark.sql.DataFrame: the input dataset with old matrix columns converted to thenew matrix type

Examples

>>>importpyspark>>>frompyspark.mllib.linalgimportMatrices>>>frompyspark.mllib.utilimportMLUtils>>>df=spark.createDataFrame(...[(0,Matrices.sparse(2,2,[0,2,3],[0,1,1],[2,3,4]),...Matrices.dense(2,2,range(4)))],["id","x","y"])>>>r1=MLUtils.convertMatrixColumnsToML(df).first()>>>isinstance(r1.x,pyspark.ml.linalg.SparseMatrix)True>>>isinstance(r1.y,pyspark.ml.linalg.DenseMatrix)True>>>r2=MLUtils.convertMatrixColumnsToML(df,"x").first()>>>isinstance(r2.x,pyspark.ml.linalg.SparseMatrix)True>>>isinstance(r2.y,pyspark.mllib.linalg.DenseMatrix)True

staticconvertVectorColumnsFromML(dataset,*cols)[source]#

Converts vector columns in an input DataFrame to thepyspark.mllib.linalg.Vector type from the newpyspark.ml.linalg.Vector type under thespark.mlpackage.

New in version 2.0.0.

Parameters

datasetpyspark.sql.DataFrame

input dataset

*colsstr

Vector columns to be converted.

Old vector columns will be ignored. If unspecified, all newvector columns will be converted except nested ones.

Returns

pyspark.sql.DataFrame: the input dataset with new vector columns converted to theold vector type

Examples

>>>importpyspark>>>frompyspark.ml.linalgimportVectors>>>frompyspark.mllib.utilimportMLUtils>>>df=spark.createDataFrame(...[(0,Vectors.sparse(2,[1],[1.0]),Vectors.dense(2.0,3.0))],...["id","x","y"])>>>r1=MLUtils.convertVectorColumnsFromML(df).first()>>>isinstance(r1.x,pyspark.mllib.linalg.SparseVector)True>>>isinstance(r1.y,pyspark.mllib.linalg.DenseVector)True>>>r2=MLUtils.convertVectorColumnsFromML(df,"x").first()>>>isinstance(r2.x,pyspark.mllib.linalg.SparseVector)True>>>isinstance(r2.y,pyspark.ml.linalg.DenseVector)True

staticconvertVectorColumnsToML(dataset,*cols)[source]#

Converts vector columns in an input DataFrame from thepyspark.mllib.linalg.Vector type to the newpyspark.ml.linalg.Vector type under thespark.mlpackage.

New in version 2.0.0.

Parameters

datasetpyspark.sql.DataFrame

input dataset

*colsstr

Vector columns to be converted.

New vector columns will be ignored. If unspecified, all oldvector columns will be converted excepted nested ones.

Returns

pyspark.sql.DataFrame: the input dataset with old vector columns converted to thenew vector type

Examples

>>>importpyspark>>>frompyspark.mllib.linalgimportVectors>>>frompyspark.mllib.utilimportMLUtils>>>df=spark.createDataFrame(...[(0,Vectors.sparse(2,[1],[1.0]),Vectors.dense(2.0,3.0))],...["id","x","y"])>>>r1=MLUtils.convertVectorColumnsToML(df).first()>>>isinstance(r1.x,pyspark.ml.linalg.SparseVector)True>>>isinstance(r1.y,pyspark.ml.linalg.DenseVector)True>>>r2=MLUtils.convertVectorColumnsToML(df,"x").first()>>>isinstance(r2.x,pyspark.ml.linalg.SparseVector)True>>>isinstance(r2.y,pyspark.mllib.linalg.DenseVector)True

staticloadLabeledPoints(sc,path,minPartitions=None)[source]#

Load labeled points saved using RDD.saveAsTextFile.

New in version 1.0.0.

Parameters

scpyspark.SparkContext: Spark context
pathstr: file or directory path in any Hadoop-supported file system URI
minPartitionsint, optional: min number of partitions

Returns

pyspark.RDD: labeled data stored as an RDD of LabeledPoint

Examples

>>>fromtempfileimportNamedTemporaryFile>>>frompyspark.mllib.utilimportMLUtils>>>frompyspark.mllib.regressionimportLabeledPoint>>>examples=[LabeledPoint(1.1,Vectors.sparse(3,[(0,-1.23),(2,4.56e-7)])),...LabeledPoint(0.0,Vectors.dense([1.01,2.02,3.03]))]>>>tempFile=NamedTemporaryFile(delete=True)>>>tempFile.close()>>>sc.parallelize(examples,1).saveAsTextFile(tempFile.name)>>>MLUtils.loadLabeledPoints(sc,tempFile.name).collect()[LabeledPoint(1.1, (3,[0,2],[-1.23,4.56e-07])), LabeledPoint(0.0, [1.01,2.02,3.03])]

staticloadLibSVMFile(sc,path,numFeatures=-1,minPartitions=None)[source]#

Loads labeled data in the LIBSVM format into an RDD ofLabeledPoint. The LIBSVM format is a text-based format used byLIBSVM and LIBLINEAR. Each line represents a labeled sparsefeature vector using the following format:

label index1:value1 index2:value2 …

where the indices are one-based and in ascending order. Thismethod parses each line into a LabeledPoint, where the featureindices are converted to zero-based.

New in version 1.0.0.

Parameters

scpyspark.SparkContext: Spark context
pathstr: file or directory path in any Hadoop-supported file system URI
numFeaturesint, optional: number of features, which will be determinedfrom the input data if a nonpositive valueis given. This is useful when the dataset isalready split into multiple files and youwant to load them separately, because somefeatures may not present in certain files,which leads to inconsistent featuredimensions.
minPartitionsint, optional: min number of partitions

Returns

pyspark.RDD: labeled data stored as an RDD of LabeledPoint

Examples

>>>fromtempfileimportNamedTemporaryFile>>>frompyspark.mllib.utilimportMLUtils>>>frompyspark.mllib.regressionimportLabeledPoint>>>tempFile=NamedTemporaryFile(delete=True)>>>_=tempFile.write(b"+1 1:1.0 3:2.0 5:3.0\n-1\n-1 2:4.0 4:5.0 6:6.0")>>>tempFile.flush()>>>examples=MLUtils.loadLibSVMFile(sc,tempFile.name).collect()>>>tempFile.close()>>>examples[0]LabeledPoint(1.0, (6,[0,2,4],[1.0,2.0,3.0]))>>>examples[1]LabeledPoint(-1.0, (6,[],[]))>>>examples[2]LabeledPoint(-1.0, (6,[1,3,5],[4.0,5.0,6.0]))

staticloadVectors(sc,path)[source]#: Loads vectors saved usingRDD[Vector].saveAsTextFilewith the default number of partitions.
New in version 1.5.0.

staticsaveAsLibSVMFile(data,dir)[source]#

Save labeled data in LIBSVM format.

New in version 1.0.0.

Parameters

datapyspark.RDD: an RDD of LabeledPoint to be saved
dirstr: directory to save the data

Examples

>>>fromtempfileimportNamedTemporaryFile>>>fromfileinputimportinput>>>frompyspark.mllib.regressionimportLabeledPoint>>>fromglobimportglob>>>frompyspark.mllib.utilimportMLUtils>>>examples=[LabeledPoint(1.1,Vectors.sparse(3,[(0,1.23),(2,4.56)])),...LabeledPoint(0.0,Vectors.dense([1.01,2.02,3.03]))]>>>tempFile=NamedTemporaryFile(delete=True)>>>tempFile.close()>>>MLUtils.saveAsLibSVMFile(sc.parallelize(examples),tempFile.name)>>>''.join(sorted(input(glob(tempFile.name+"/part-0000*"))))'0.0 1:1.01 2:2.02 3:3.03\n1.1 1:1.23 3:4.56\n'

Show Source

Movatterモバイル変換

MLUtils#

MLUtils #