RandomRDDs #

classpyspark.mllib.random.RandomRDDs[source]#

Generator methods for creating RDDs comprised of i.i.d samples fromsome distribution.

New in version 1.1.0.

Methods

`exponentialRDD`(sc, mean, size[, ...])	Generates an RDD comprised of i.i.d.
`exponentialVectorRDD`(sc, mean, numRows, numCols)	Generates an RDD comprised of vectors containing i.i.d.
`gammaRDD`(sc, shape, scale, size[, ...])	Generates an RDD comprised of i.i.d.
`gammaVectorRDD`(sc, shape, scale, numRows, ...)	Generates an RDD comprised of vectors containing i.i.d.
`logNormalRDD`(sc, mean, std, size[, ...])	Generates an RDD comprised of i.i.d.
`logNormalVectorRDD`(sc, mean, std, numRows, ...)	Generates an RDD comprised of vectors containing i.i.d.
`normalRDD`(sc, size[, numPartitions, seed])	Generates an RDD comprised of i.i.d.
`normalVectorRDD`(sc, numRows, numCols[, ...])	Generates an RDD comprised of vectors containing i.i.d.
`poissonRDD`(sc, mean, size[, numPartitions, seed])	Generates an RDD comprised of i.i.d.
`poissonVectorRDD`(sc, mean, numRows, numCols)	Generates an RDD comprised of vectors containing i.i.d.
`uniformRDD`(sc, size[, numPartitions, seed])	Generates an RDD comprised of i.i.d.
`uniformVectorRDD`(sc, numRows, numCols[, ...])	Generates an RDD comprised of vectors containing i.i.d.

Methods Documentation

staticexponentialRDD(sc,mean,size,numPartitions=None,seed=None)[source]#

Generates an RDD comprised of i.i.d. samples from the Exponentialdistribution with the input mean.

New in version 1.3.0.

Parameters

scpyspark.SparkContext: SparkContext used to create the RDD.
meanfloat: Mean, or 1 / lambda, for the Exponential distribution.
sizeint: Size of the RDD.
numPartitionsint, optional: Number of partitions in the RDD (default:sc.defaultParallelism).
seedint, optional: Random seed (default: a random long integer).

Returns

pyspark.RDD: RDD of float comprised of i.i.d. samples ~ Exp(mean).

Examples

>>>mean=2.0>>>x=RandomRDDs.exponentialRDD(sc,mean,1000,seed=2)>>>stats=x.stats()>>>stats.count()1000>>>abs(stats.mean()-mean)<0.5True>>>frommathimportsqrt>>>bool(abs(stats.stdev()-sqrt(mean))<0.5)True

staticexponentialVectorRDD(sc,mean,numRows,numCols,numPartitions=None,seed=None)[source]#

Generates an RDD comprised of vectors containing i.i.d. samples drawnfrom the Exponential distribution with the input mean.

New in version 1.3.0.

Parameters

scpyspark.SparkContext: SparkContext used to create the RDD.
meanfloat: Mean, or 1 / lambda, for the Exponential distribution.
numRowsint: Number of Vectors in the RDD.
numColsint: Number of elements in each Vector.
numPartitionsint, optional: Number of partitions in the RDD (default:sc.defaultParallelism)
seedint, optional: Random seed (default: a random long integer).

Returns

pyspark.RDD: RDD of Vector with vectors containing i.i.d. samples ~ Exp(mean).

Examples

>>>importnumpyasnp>>>mean=0.5>>>rdd=RandomRDDs.exponentialVectorRDD(sc,mean,100,100,seed=1)>>>mat=np.asmatrix(rdd.collect())>>>mat.shape(100, 100)>>>bool(abs(mat.mean()-mean)<0.5)True>>>frommathimportsqrt>>>bool(abs(mat.std()-sqrt(mean))<0.5)True

staticgammaRDD(sc,shape,scale,size,numPartitions=None,seed=None)[source]#

Generates an RDD comprised of i.i.d. samples from the Gammadistribution with the input shape and scale.

New in version 1.3.0.

Parameters

scpyspark.SparkContext: SparkContext used to create the RDD.
shapefloat: shape (> 0) parameter for the Gamma distribution
scalefloat: scale (> 0) parameter for the Gamma distribution
sizeint: Size of the RDD.
numPartitionsint, optional: Number of partitions in the RDD (default:sc.defaultParallelism).
seedint, optional: Random seed (default: a random long integer).

Returns

pyspark.RDD: RDD of float comprised of i.i.d. samples ~ Gamma(shape, scale).

Examples

>>>frommathimportsqrt>>>shape=1.0>>>scale=2.0>>>expMean=shape*scale>>>expStd=sqrt(shape*scale*scale)>>>x=RandomRDDs.gammaRDD(sc,shape,scale,1000,seed=2)>>>stats=x.stats()>>>stats.count()1000>>>bool(abs(stats.mean()-expMean)<0.5)True>>>bool(abs(stats.stdev()-expStd)<0.5)True

staticgammaVectorRDD(sc,shape,scale,numRows,numCols,numPartitions=None,seed=None)[source]#

Generates an RDD comprised of vectors containing i.i.d. samples drawnfrom the Gamma distribution.

New in version 1.3.0.

Parameters

scpyspark.SparkContext: SparkContext used to create the RDD.
shapefloat: Shape (> 0) of the Gamma distribution
scalefloat: Scale (> 0) of the Gamma distribution
numRowsint: Number of Vectors in the RDD.
numColsint: Number of elements in each Vector.
numPartitionsint, optional: Number of partitions in the RDD (default:sc.defaultParallelism).
seedint, optional,: Random seed (default: a random long integer).

Returns

pyspark.RDD: RDD of Vector with vectors containing i.i.d. samples ~ Gamma(shape, scale).

Examples

>>>importnumpyasnp>>>frommathimportsqrt>>>shape=1.0>>>scale=2.0>>>expMean=shape*scale>>>expStd=sqrt(shape*scale*scale)>>>mat=np.matrix(RandomRDDs.gammaVectorRDD(sc,shape,scale,100,100,seed=1).collect())>>>mat.shape(100, 100)>>>bool(abs(mat.mean()-expMean)<0.1)True>>>bool(abs(mat.std()-expStd)<0.1)True

staticlogNormalRDD(sc,mean,std,size,numPartitions=None,seed=None)[source]#

Generates an RDD comprised of i.i.d. samples from the log normaldistribution with the input mean and standard distribution.

New in version 1.3.0.

Parameters

scpyspark.SparkContext: used to create the RDD.
meanfloat: mean for the log Normal distribution
stdfloat: std for the log Normal distribution
sizeint: Size of the RDD.
numPartitionsint, optional: Number of partitions in the RDD (default:sc.defaultParallelism).
seedint, optional: Random seed (default: a random long integer).

Returns

RDD of float comprised of i.i.d. samples ~ log N(mean, std).

Examples

>>>frommathimportsqrt,exp>>>mean=0.0>>>std=1.0>>>expMean=exp(mean+0.5*std*std)>>>expStd=sqrt((exp(std*std)-1.0)*exp(2.0*mean+std*std))>>>x=RandomRDDs.logNormalRDD(sc,mean,std,1000,seed=2)>>>stats=x.stats()>>>stats.count()1000>>>bool(abs(stats.mean()-expMean)<0.5)True>>>frommathimportsqrt>>>bool(abs(stats.stdev()-expStd)<0.5)True

staticlogNormalVectorRDD(sc,mean,std,numRows,numCols,numPartitions=None,seed=None)[source]#

Generates an RDD comprised of vectors containing i.i.d. samples drawnfrom the log normal distribution.

New in version 1.3.0.

Parameters

scpyspark.SparkContext: SparkContext used to create the RDD.
meanfloat: Mean of the log normal distribution
stdfloat: Standard Deviation of the log normal distribution
numRowsint: Number of Vectors in the RDD.
numColsint: Number of elements in each Vector.
numPartitionsint, optional: Number of partitions in the RDD (default:sc.defaultParallelism).
seedint, optional: Random seed (default: a random long integer).

Returns

pyspark.RDD: RDD of Vector with vectors containing i.i.d. samples ~ logN(mean, std).

Examples

>>>importnumpyasnp>>>frommathimportsqrt,exp>>>mean=0.0>>>std=1.0>>>expMean=exp(mean+0.5*std*std)>>>expStd=sqrt((exp(std*std)-1.0)*exp(2.0*mean+std*std))>>>m=RandomRDDs.logNormalVectorRDD(sc,mean,std,100,100,seed=1).collect()>>>mat=np.matrix(m)>>>mat.shape(100, 100)>>>bool(abs(mat.mean()-expMean)<0.1)True>>>bool(abs(mat.std()-expStd)<0.1)True

staticnormalRDD(sc,size,numPartitions=None,seed=None)[source]#

Generates an RDD comprised of i.i.d. samples from the standard normaldistribution.

To transform the distribution in the generated RDD from standard normalto some other normal N(mean, sigma^2), useRandomRDDs.normal(sc,n,p,seed).map(lambdav:mean+sigma*v)

New in version 1.1.0.

Parameters

scpyspark.SparkContext: used to create the RDD.
sizeint: Size of the RDD.
numPartitionsint, optional: Number of partitions in the RDD (default:sc.defaultParallelism).
seedint, optional: Random seed (default: a random long integer).

Returns

pyspark.RDD: RDD of float comprised of i.i.d. samples ~ N(0.0, 1.0).

Examples

>>>x=RandomRDDs.normalRDD(sc,1000,seed=1)>>>stats=x.stats()>>>stats.count()1000>>>bool(abs(stats.mean()-0.0)<0.1)True>>>bool(abs(stats.stdev()-1.0)<0.1)True

staticnormalVectorRDD(sc,numRows,numCols,numPartitions=None,seed=None)[source]#

Generates an RDD comprised of vectors containing i.i.d. samples drawnfrom the standard normal distribution.

New in version 1.1.0.

Parameters

scpyspark.SparkContext: SparkContext used to create the RDD.
numRowsint: Number of Vectors in the RDD.
numColsint: Number of elements in each Vector.
numPartitionsint, optional: Number of partitions in the RDD (default:sc.defaultParallelism).
seedint, optional: Random seed (default: a random long integer).

Returns

pyspark.RDD: RDD of Vector with vectors containing i.i.d. samples ~N(0.0, 1.0).

Examples

>>>importnumpyasnp>>>mat=np.matrix(RandomRDDs.normalVectorRDD(sc,100,100,seed=1).collect())>>>mat.shape(100, 100)>>>bool(abs(mat.mean()-0.0)<0.1)True>>>bool(abs(mat.std()-1.0)<0.1)True

staticpoissonRDD(sc,mean,size,numPartitions=None,seed=None)[source]#

Generates an RDD comprised of i.i.d. samples from the Poissondistribution with the input mean.

New in version 1.1.0.

Parameters

scpyspark.SparkContext: SparkContext used to create the RDD.
meanfloat: Mean, or lambda, for the Poisson distribution.
sizeint: Size of the RDD.
numPartitionsint, optional: Number of partitions in the RDD (default:sc.defaultParallelism).
seedint, optional: Random seed (default: a random long integer).

Returns

pyspark.RDD: RDD of float comprised of i.i.d. samples ~ Pois(mean).

Examples

>>>mean=100.0>>>x=RandomRDDs.poissonRDD(sc,mean,1000,seed=2)>>>stats=x.stats()>>>stats.count()1000>>>abs(stats.mean()-mean)<0.5True>>>frommathimportsqrt>>>bool(abs(stats.stdev()-sqrt(mean))<0.5)True

staticpoissonVectorRDD(sc,mean,numRows,numCols,numPartitions=None,seed=None)[source]#

Generates an RDD comprised of vectors containing i.i.d. samples drawnfrom the Poisson distribution with the input mean.

New in version 1.1.0.

Parameters

scpyspark.SparkContext: SparkContext used to create the RDD.
meanfloat: Mean, or lambda, for the Poisson distribution.
numRowsfloat: Number of Vectors in the RDD.
numColsint: Number of elements in each Vector.
numPartitionsint, optional: Number of partitions in the RDD (default:sc.defaultParallelism)
seedint, optional: Random seed (default: a random long integer).

Returns

pyspark.RDD: RDD of Vector with vectors containing i.i.d. samples ~ Pois(mean).

Examples

>>>importnumpyasnp>>>mean=100.0>>>rdd=RandomRDDs.poissonVectorRDD(sc,mean,100,100,seed=1)>>>mat=np.asmatrix(rdd.collect())>>>mat.shape(100, 100)>>>bool(abs(mat.mean()-mean)<0.5)True>>>frommathimportsqrt>>>bool(abs(mat.std()-sqrt(mean))<0.5)True

staticuniformRDD(sc,size,numPartitions=None,seed=None)[source]#

Generates an RDD comprised of i.i.d. samples from theuniform distribution U(0.0, 1.0).

To transform the distribution in the generated RDD from U(0.0, 1.0)to U(a, b), useRandomRDDs.uniformRDD(sc,n,p,seed).map(lambdav:a+(b-a)*v)

New in version 1.1.0.

Parameters

scpyspark.SparkContext: used to create the RDD.
sizeint: Size of the RDD.
numPartitionsint, optional: Number of partitions in the RDD (default:sc.defaultParallelism).
seedint, optional: Random seed (default: a random long integer).

Returns

pyspark.RDD: RDD of float comprised of i.i.d. samples ~U(0.0, 1.0).

Examples

>>>x=RandomRDDs.uniformRDD(sc,100).collect()>>>len(x)100>>>max(x)<=1.0andmin(x)>=0.0True>>>RandomRDDs.uniformRDD(sc,100,4).getNumPartitions()4>>>parts=RandomRDDs.uniformRDD(sc,100,seed=4).getNumPartitions()>>>parts==sc.defaultParallelismTrue

staticuniformVectorRDD(sc,numRows,numCols,numPartitions=None,seed=None)[source]#

Generates an RDD comprised of vectors containing i.i.d. samples drawnfrom the uniform distribution U(0.0, 1.0).

New in version 1.1.0.

Parameters

scpyspark.SparkContext: SparkContext used to create the RDD.
numRowsint: Number of Vectors in the RDD.
numColsint: Number of elements in each Vector.
numPartitionsint, optional: Number of partitions in the RDD.
seedint, optional: Seed for the RNG that generates the seed for the generator in each partition.

Returns

pyspark.RDD: RDD of Vector with vectors containing i.i.d samples ~U(0.0, 1.0).

Examples

>>>importnumpyasnp>>>mat=np.matrix(RandomRDDs.uniformVectorRDD(sc,10,10).collect())>>>mat.shape(10, 10)>>>bool(mat.max()<=1.0andmat.min()>=0.0)True>>>RandomRDDs.uniformVectorRDD(sc,10,10,4).getNumPartitions()4

Show Source

Movatterモバイル変換

RandomRDDs#

RandomRDDs #