Movatterモバイル変換


[0]ホーム

URL:


Skip to main content
Ctrl+K

MLUtils#

classpyspark.mllib.util.MLUtils[source]#

Helper methods to load, save and pre-process data used in MLlib.

New in version 1.0.0.

Methods

appendBias(data)

Returns a new vector with1.0 (bias) appended to the end of the input vector.

convertMatrixColumnsFromML(dataset, *cols)

Converts matrix columns in an input DataFrame to thepyspark.mllib.linalg.Matrix type from the newpyspark.ml.linalg.Matrix type under thespark.ml package.

convertMatrixColumnsToML(dataset, *cols)

Converts matrix columns in an input DataFrame from thepyspark.mllib.linalg.Matrix type to the newpyspark.ml.linalg.Matrix type under thespark.ml package.

convertVectorColumnsFromML(dataset, *cols)

Converts vector columns in an input DataFrame to thepyspark.mllib.linalg.Vector type from the newpyspark.ml.linalg.Vector type under thespark.ml package.

convertVectorColumnsToML(dataset, *cols)

Converts vector columns in an input DataFrame from thepyspark.mllib.linalg.Vector type to the newpyspark.ml.linalg.Vector type under thespark.ml package.

loadLabeledPoints(sc, path[, minPartitions])

Load labeled points saved using RDD.saveAsTextFile.

loadLibSVMFile(sc, path[, numFeatures, ...])

Loads labeled data in the LIBSVM format into an RDD of LabeledPoint.

loadVectors(sc, path)

Loads vectors saved usingRDD[Vector].saveAsTextFile with the default number of partitions.

saveAsLibSVMFile(data, dir)

Save labeled data in LIBSVM format.

Methods Documentation

staticappendBias(data)[source]#

Returns a new vector with1.0 (bias) appended tothe end of the input vector.

New in version 1.5.0.

staticconvertMatrixColumnsFromML(dataset,*cols)[source]#

Converts matrix columns in an input DataFrame to thepyspark.mllib.linalg.Matrix type from the newpyspark.ml.linalg.Matrix type under thespark.mlpackage.

New in version 2.0.0.

Parameters
datasetpyspark.sql.DataFrame

input dataset

*colsstr

Matrix columns to be converted.

Old matrix columns will be ignored. If unspecified, all newmatrix columns will be converted except nested ones.

Returns
pyspark.sql.DataFrame

the input dataset with new matrix columns converted to theold matrix type

Examples

>>>importpyspark>>>frompyspark.ml.linalgimportMatrices>>>frompyspark.mllib.utilimportMLUtils>>>df=spark.createDataFrame(...[(0,Matrices.sparse(2,2,[0,2,3],[0,1,1],[2,3,4]),...Matrices.dense(2,2,range(4)))],["id","x","y"])>>>r1=MLUtils.convertMatrixColumnsFromML(df).first()>>>isinstance(r1.x,pyspark.mllib.linalg.SparseMatrix)True>>>isinstance(r1.y,pyspark.mllib.linalg.DenseMatrix)True>>>r2=MLUtils.convertMatrixColumnsFromML(df,"x").first()>>>isinstance(r2.x,pyspark.mllib.linalg.SparseMatrix)True>>>isinstance(r2.y,pyspark.ml.linalg.DenseMatrix)True
staticconvertMatrixColumnsToML(dataset,*cols)[source]#

Converts matrix columns in an input DataFrame from thepyspark.mllib.linalg.Matrix type to the newpyspark.ml.linalg.Matrix type under thespark.mlpackage.

New in version 2.0.0.

Parameters
datasetpyspark.sql.DataFrame

input dataset

*colsstr

Matrix columns to be converted.

New matrix columns will be ignored. If unspecified, all oldmatrix columns will be converted excepted nested ones.

Returns
pyspark.sql.DataFrame

the input dataset with old matrix columns converted to thenew matrix type

Examples

>>>importpyspark>>>frompyspark.mllib.linalgimportMatrices>>>frompyspark.mllib.utilimportMLUtils>>>df=spark.createDataFrame(...[(0,Matrices.sparse(2,2,[0,2,3],[0,1,1],[2,3,4]),...Matrices.dense(2,2,range(4)))],["id","x","y"])>>>r1=MLUtils.convertMatrixColumnsToML(df).first()>>>isinstance(r1.x,pyspark.ml.linalg.SparseMatrix)True>>>isinstance(r1.y,pyspark.ml.linalg.DenseMatrix)True>>>r2=MLUtils.convertMatrixColumnsToML(df,"x").first()>>>isinstance(r2.x,pyspark.ml.linalg.SparseMatrix)True>>>isinstance(r2.y,pyspark.mllib.linalg.DenseMatrix)True
staticconvertVectorColumnsFromML(dataset,*cols)[source]#

Converts vector columns in an input DataFrame to thepyspark.mllib.linalg.Vector type from the newpyspark.ml.linalg.Vector type under thespark.mlpackage.

New in version 2.0.0.

Parameters
datasetpyspark.sql.DataFrame

input dataset

*colsstr

Vector columns to be converted.

Old vector columns will be ignored. If unspecified, all newvector columns will be converted except nested ones.

Returns
pyspark.sql.DataFrame

the input dataset with new vector columns converted to theold vector type

Examples

>>>importpyspark>>>frompyspark.ml.linalgimportVectors>>>frompyspark.mllib.utilimportMLUtils>>>df=spark.createDataFrame(...[(0,Vectors.sparse(2,[1],[1.0]),Vectors.dense(2.0,3.0))],...["id","x","y"])>>>r1=MLUtils.convertVectorColumnsFromML(df).first()>>>isinstance(r1.x,pyspark.mllib.linalg.SparseVector)True>>>isinstance(r1.y,pyspark.mllib.linalg.DenseVector)True>>>r2=MLUtils.convertVectorColumnsFromML(df,"x").first()>>>isinstance(r2.x,pyspark.mllib.linalg.SparseVector)True>>>isinstance(r2.y,pyspark.ml.linalg.DenseVector)True
staticconvertVectorColumnsToML(dataset,*cols)[source]#

Converts vector columns in an input DataFrame from thepyspark.mllib.linalg.Vector type to the newpyspark.ml.linalg.Vector type under thespark.mlpackage.

New in version 2.0.0.

Parameters
datasetpyspark.sql.DataFrame

input dataset

*colsstr

Vector columns to be converted.

New vector columns will be ignored. If unspecified, all oldvector columns will be converted excepted nested ones.

Returns
pyspark.sql.DataFrame

the input dataset with old vector columns converted to thenew vector type

Examples

>>>importpyspark>>>frompyspark.mllib.linalgimportVectors>>>frompyspark.mllib.utilimportMLUtils>>>df=spark.createDataFrame(...[(0,Vectors.sparse(2,[1],[1.0]),Vectors.dense(2.0,3.0))],...["id","x","y"])>>>r1=MLUtils.convertVectorColumnsToML(df).first()>>>isinstance(r1.x,pyspark.ml.linalg.SparseVector)True>>>isinstance(r1.y,pyspark.ml.linalg.DenseVector)True>>>r2=MLUtils.convertVectorColumnsToML(df,"x").first()>>>isinstance(r2.x,pyspark.ml.linalg.SparseVector)True>>>isinstance(r2.y,pyspark.mllib.linalg.DenseVector)True
staticloadLabeledPoints(sc,path,minPartitions=None)[source]#

Load labeled points saved using RDD.saveAsTextFile.

New in version 1.0.0.

Parameters
scpyspark.SparkContext

Spark context

pathstr

file or directory path in any Hadoop-supported file system URI

minPartitionsint, optional

min number of partitions

Returns
pyspark.RDD

labeled data stored as an RDD of LabeledPoint

Examples

>>>fromtempfileimportNamedTemporaryFile>>>frompyspark.mllib.utilimportMLUtils>>>frompyspark.mllib.regressionimportLabeledPoint>>>examples=[LabeledPoint(1.1,Vectors.sparse(3,[(0,-1.23),(2,4.56e-7)])),...LabeledPoint(0.0,Vectors.dense([1.01,2.02,3.03]))]>>>tempFile=NamedTemporaryFile(delete=True)>>>tempFile.close()>>>sc.parallelize(examples,1).saveAsTextFile(tempFile.name)>>>MLUtils.loadLabeledPoints(sc,tempFile.name).collect()[LabeledPoint(1.1, (3,[0,2],[-1.23,4.56e-07])), LabeledPoint(0.0, [1.01,2.02,3.03])]
staticloadLibSVMFile(sc,path,numFeatures=-1,minPartitions=None)[source]#

Loads labeled data in the LIBSVM format into an RDD ofLabeledPoint. The LIBSVM format is a text-based format used byLIBSVM and LIBLINEAR. Each line represents a labeled sparsefeature vector using the following format:

label index1:value1 index2:value2 …

where the indices are one-based and in ascending order. Thismethod parses each line into a LabeledPoint, where the featureindices are converted to zero-based.

New in version 1.0.0.

Parameters
scpyspark.SparkContext

Spark context

pathstr

file or directory path in any Hadoop-supported file system URI

numFeaturesint, optional

number of features, which will be determinedfrom the input data if a nonpositive valueis given. This is useful when the dataset isalready split into multiple files and youwant to load them separately, because somefeatures may not present in certain files,which leads to inconsistent featuredimensions.

minPartitionsint, optional

min number of partitions

Returns
pyspark.RDD

labeled data stored as an RDD of LabeledPoint

Examples

>>>fromtempfileimportNamedTemporaryFile>>>frompyspark.mllib.utilimportMLUtils>>>frompyspark.mllib.regressionimportLabeledPoint>>>tempFile=NamedTemporaryFile(delete=True)>>>_=tempFile.write(b"+1 1:1.0 3:2.0 5:3.0\n-1\n-1 2:4.0 4:5.0 6:6.0")>>>tempFile.flush()>>>examples=MLUtils.loadLibSVMFile(sc,tempFile.name).collect()>>>tempFile.close()>>>examples[0]LabeledPoint(1.0, (6,[0,2,4],[1.0,2.0,3.0]))>>>examples[1]LabeledPoint(-1.0, (6,[],[]))>>>examples[2]LabeledPoint(-1.0, (6,[1,3,5],[4.0,5.0,6.0]))
staticloadVectors(sc,path)[source]#

Loads vectors saved usingRDD[Vector].saveAsTextFilewith the default number of partitions.

New in version 1.5.0.

staticsaveAsLibSVMFile(data,dir)[source]#

Save labeled data in LIBSVM format.

New in version 1.0.0.

Parameters
datapyspark.RDD

an RDD of LabeledPoint to be saved

dirstr

directory to save the data

Examples

>>>fromtempfileimportNamedTemporaryFile>>>fromfileinputimportinput>>>frompyspark.mllib.regressionimportLabeledPoint>>>fromglobimportglob>>>frompyspark.mllib.utilimportMLUtils>>>examples=[LabeledPoint(1.1,Vectors.sparse(3,[(0,1.23),(2,4.56)])),...LabeledPoint(0.0,Vectors.dense([1.01,2.02,3.03]))]>>>tempFile=NamedTemporaryFile(delete=True)>>>tempFile.close()>>>MLUtils.saveAsLibSVMFile(sc.parallelize(examples),tempFile.name)>>>''.join(sorted(input(glob(tempFile.name+"/part-0000*"))))'0.0 1:1.01 2:2.02 3:3.03\n1.1 1:1.23 3:4.56\n'

[8]ページ先頭

©2009-2025 Movatter.jp