RandomRDDs#
- classpyspark.mllib.random.RandomRDDs[source]#
Generator methods for creating RDDs comprised of i.i.d samples fromsome distribution.
New in version 1.1.0.
Methods
exponentialRDD(sc, mean, size[, ...])Generates an RDD comprised of i.i.d.
exponentialVectorRDD(sc, mean, numRows, numCols)Generates an RDD comprised of vectors containing i.i.d.
gammaRDD(sc, shape, scale, size[, ...])Generates an RDD comprised of i.i.d.
gammaVectorRDD(sc, shape, scale, numRows, ...)Generates an RDD comprised of vectors containing i.i.d.
logNormalRDD(sc, mean, std, size[, ...])Generates an RDD comprised of i.i.d.
logNormalVectorRDD(sc, mean, std, numRows, ...)Generates an RDD comprised of vectors containing i.i.d.
normalRDD(sc, size[, numPartitions, seed])Generates an RDD comprised of i.i.d.
normalVectorRDD(sc, numRows, numCols[, ...])Generates an RDD comprised of vectors containing i.i.d.
poissonRDD(sc, mean, size[, numPartitions, seed])Generates an RDD comprised of i.i.d.
poissonVectorRDD(sc, mean, numRows, numCols)Generates an RDD comprised of vectors containing i.i.d.
uniformRDD(sc, size[, numPartitions, seed])Generates an RDD comprised of i.i.d.
uniformVectorRDD(sc, numRows, numCols[, ...])Generates an RDD comprised of vectors containing i.i.d.
Methods Documentation
- staticexponentialRDD(sc,mean,size,numPartitions=None,seed=None)[source]#
Generates an RDD comprised of i.i.d. samples from the Exponentialdistribution with the input mean.
New in version 1.3.0.
- Parameters
- sc
pyspark.SparkContext SparkContext used to create the RDD.
- meanfloat
Mean, or 1 / lambda, for the Exponential distribution.
- sizeint
Size of the RDD.
- numPartitionsint, optional
Number of partitions in the RDD (default:sc.defaultParallelism).
- seedint, optional
Random seed (default: a random long integer).
- sc
- Returns
pyspark.RDDRDD of float comprised of i.i.d. samples ~ Exp(mean).
Examples
>>>mean=2.0>>>x=RandomRDDs.exponentialRDD(sc,mean,1000,seed=2)>>>stats=x.stats()>>>stats.count()1000>>>abs(stats.mean()-mean)<0.5True>>>frommathimportsqrt>>>bool(abs(stats.stdev()-sqrt(mean))<0.5)True
- staticexponentialVectorRDD(sc,mean,numRows,numCols,numPartitions=None,seed=None)[source]#
Generates an RDD comprised of vectors containing i.i.d. samples drawnfrom the Exponential distribution with the input mean.
New in version 1.3.0.
- Parameters
- sc
pyspark.SparkContext SparkContext used to create the RDD.
- meanfloat
Mean, or 1 / lambda, for the Exponential distribution.
- numRowsint
Number of Vectors in the RDD.
- numColsint
Number of elements in each Vector.
- numPartitionsint, optional
Number of partitions in the RDD (default:sc.defaultParallelism)
- seedint, optional
Random seed (default: a random long integer).
- sc
- Returns
pyspark.RDDRDD of Vector with vectors containing i.i.d. samples ~ Exp(mean).
Examples
>>>importnumpyasnp>>>mean=0.5>>>rdd=RandomRDDs.exponentialVectorRDD(sc,mean,100,100,seed=1)>>>mat=np.asmatrix(rdd.collect())>>>mat.shape(100, 100)>>>bool(abs(mat.mean()-mean)<0.5)True>>>frommathimportsqrt>>>bool(abs(mat.std()-sqrt(mean))<0.5)True
- staticgammaRDD(sc,shape,scale,size,numPartitions=None,seed=None)[source]#
Generates an RDD comprised of i.i.d. samples from the Gammadistribution with the input shape and scale.
New in version 1.3.0.
- Parameters
- sc
pyspark.SparkContext SparkContext used to create the RDD.
- shapefloat
shape (> 0) parameter for the Gamma distribution
- scalefloat
scale (> 0) parameter for the Gamma distribution
- sizeint
Size of the RDD.
- numPartitionsint, optional
Number of partitions in the RDD (default:sc.defaultParallelism).
- seedint, optional
Random seed (default: a random long integer).
- sc
- Returns
pyspark.RDDRDD of float comprised of i.i.d. samples ~ Gamma(shape, scale).
Examples
>>>frommathimportsqrt>>>shape=1.0>>>scale=2.0>>>expMean=shape*scale>>>expStd=sqrt(shape*scale*scale)>>>x=RandomRDDs.gammaRDD(sc,shape,scale,1000,seed=2)>>>stats=x.stats()>>>stats.count()1000>>>bool(abs(stats.mean()-expMean)<0.5)True>>>bool(abs(stats.stdev()-expStd)<0.5)True
- staticgammaVectorRDD(sc,shape,scale,numRows,numCols,numPartitions=None,seed=None)[source]#
Generates an RDD comprised of vectors containing i.i.d. samples drawnfrom the Gamma distribution.
New in version 1.3.0.
- Parameters
- sc
pyspark.SparkContext SparkContext used to create the RDD.
- shapefloat
Shape (> 0) of the Gamma distribution
- scalefloat
Scale (> 0) of the Gamma distribution
- numRowsint
Number of Vectors in the RDD.
- numColsint
Number of elements in each Vector.
- numPartitionsint, optional
Number of partitions in the RDD (default:sc.defaultParallelism).
- seedint, optional,
Random seed (default: a random long integer).
- sc
- Returns
pyspark.RDDRDD of Vector with vectors containing i.i.d. samples ~ Gamma(shape, scale).
Examples
>>>importnumpyasnp>>>frommathimportsqrt>>>shape=1.0>>>scale=2.0>>>expMean=shape*scale>>>expStd=sqrt(shape*scale*scale)>>>mat=np.matrix(RandomRDDs.gammaVectorRDD(sc,shape,scale,100,100,seed=1).collect())>>>mat.shape(100, 100)>>>bool(abs(mat.mean()-expMean)<0.1)True>>>bool(abs(mat.std()-expStd)<0.1)True
- staticlogNormalRDD(sc,mean,std,size,numPartitions=None,seed=None)[source]#
Generates an RDD comprised of i.i.d. samples from the log normaldistribution with the input mean and standard distribution.
New in version 1.3.0.
- Parameters
- sc
pyspark.SparkContext used to create the RDD.
- meanfloat
mean for the log Normal distribution
- stdfloat
std for the log Normal distribution
- sizeint
Size of the RDD.
- numPartitionsint, optional
Number of partitions in the RDD (default:sc.defaultParallelism).
- seedint, optional
Random seed (default: a random long integer).
- sc
- Returns
- RDD of float comprised of i.i.d. samples ~ log N(mean, std).
Examples
>>>frommathimportsqrt,exp>>>mean=0.0>>>std=1.0>>>expMean=exp(mean+0.5*std*std)>>>expStd=sqrt((exp(std*std)-1.0)*exp(2.0*mean+std*std))>>>x=RandomRDDs.logNormalRDD(sc,mean,std,1000,seed=2)>>>stats=x.stats()>>>stats.count()1000>>>bool(abs(stats.mean()-expMean)<0.5)True>>>frommathimportsqrt>>>bool(abs(stats.stdev()-expStd)<0.5)True
- staticlogNormalVectorRDD(sc,mean,std,numRows,numCols,numPartitions=None,seed=None)[source]#
Generates an RDD comprised of vectors containing i.i.d. samples drawnfrom the log normal distribution.
New in version 1.3.0.
- Parameters
- sc
pyspark.SparkContext SparkContext used to create the RDD.
- meanfloat
Mean of the log normal distribution
- stdfloat
Standard Deviation of the log normal distribution
- numRowsint
Number of Vectors in the RDD.
- numColsint
Number of elements in each Vector.
- numPartitionsint, optional
Number of partitions in the RDD (default:sc.defaultParallelism).
- seedint, optional
Random seed (default: a random long integer).
- sc
- Returns
pyspark.RDDRDD of Vector with vectors containing i.i.d. samples ~ logN(mean, std).
Examples
>>>importnumpyasnp>>>frommathimportsqrt,exp>>>mean=0.0>>>std=1.0>>>expMean=exp(mean+0.5*std*std)>>>expStd=sqrt((exp(std*std)-1.0)*exp(2.0*mean+std*std))>>>m=RandomRDDs.logNormalVectorRDD(sc,mean,std,100,100,seed=1).collect()>>>mat=np.matrix(m)>>>mat.shape(100, 100)>>>bool(abs(mat.mean()-expMean)<0.1)True>>>bool(abs(mat.std()-expStd)<0.1)True
- staticnormalRDD(sc,size,numPartitions=None,seed=None)[source]#
Generates an RDD comprised of i.i.d. samples from the standard normaldistribution.
To transform the distribution in the generated RDD from standard normalto some other normal N(mean, sigma^2), use
RandomRDDs.normal(sc,n,p,seed).map(lambdav:mean+sigma*v)New in version 1.1.0.
- Parameters
- sc
pyspark.SparkContext used to create the RDD.
- sizeint
Size of the RDD.
- numPartitionsint, optional
Number of partitions in the RDD (default:sc.defaultParallelism).
- seedint, optional
Random seed (default: a random long integer).
- sc
- Returns
pyspark.RDDRDD of float comprised of i.i.d. samples ~ N(0.0, 1.0).
Examples
>>>x=RandomRDDs.normalRDD(sc,1000,seed=1)>>>stats=x.stats()>>>stats.count()1000>>>bool(abs(stats.mean()-0.0)<0.1)True>>>bool(abs(stats.stdev()-1.0)<0.1)True
- staticnormalVectorRDD(sc,numRows,numCols,numPartitions=None,seed=None)[source]#
Generates an RDD comprised of vectors containing i.i.d. samples drawnfrom the standard normal distribution.
New in version 1.1.0.
- Parameters
- sc
pyspark.SparkContext SparkContext used to create the RDD.
- numRowsint
Number of Vectors in the RDD.
- numColsint
Number of elements in each Vector.
- numPartitionsint, optional
Number of partitions in the RDD (default:sc.defaultParallelism).
- seedint, optional
Random seed (default: a random long integer).
- sc
- Returns
pyspark.RDDRDD of Vector with vectors containing i.i.d. samples ~N(0.0, 1.0).
Examples
>>>importnumpyasnp>>>mat=np.matrix(RandomRDDs.normalVectorRDD(sc,100,100,seed=1).collect())>>>mat.shape(100, 100)>>>bool(abs(mat.mean()-0.0)<0.1)True>>>bool(abs(mat.std()-1.0)<0.1)True
- staticpoissonRDD(sc,mean,size,numPartitions=None,seed=None)[source]#
Generates an RDD comprised of i.i.d. samples from the Poissondistribution with the input mean.
New in version 1.1.0.
- Parameters
- sc
pyspark.SparkContext SparkContext used to create the RDD.
- meanfloat
Mean, or lambda, for the Poisson distribution.
- sizeint
Size of the RDD.
- numPartitionsint, optional
Number of partitions in the RDD (default:sc.defaultParallelism).
- seedint, optional
Random seed (default: a random long integer).
- sc
- Returns
pyspark.RDDRDD of float comprised of i.i.d. samples ~ Pois(mean).
Examples
>>>mean=100.0>>>x=RandomRDDs.poissonRDD(sc,mean,1000,seed=2)>>>stats=x.stats()>>>stats.count()1000>>>abs(stats.mean()-mean)<0.5True>>>frommathimportsqrt>>>bool(abs(stats.stdev()-sqrt(mean))<0.5)True
- staticpoissonVectorRDD(sc,mean,numRows,numCols,numPartitions=None,seed=None)[source]#
Generates an RDD comprised of vectors containing i.i.d. samples drawnfrom the Poisson distribution with the input mean.
New in version 1.1.0.
- Parameters
- sc
pyspark.SparkContext SparkContext used to create the RDD.
- meanfloat
Mean, or lambda, for the Poisson distribution.
- numRowsfloat
Number of Vectors in the RDD.
- numColsint
Number of elements in each Vector.
- numPartitionsint, optional
Number of partitions in the RDD (default:sc.defaultParallelism)
- seedint, optional
Random seed (default: a random long integer).
- sc
- Returns
pyspark.RDDRDD of Vector with vectors containing i.i.d. samples ~ Pois(mean).
Examples
>>>importnumpyasnp>>>mean=100.0>>>rdd=RandomRDDs.poissonVectorRDD(sc,mean,100,100,seed=1)>>>mat=np.asmatrix(rdd.collect())>>>mat.shape(100, 100)>>>bool(abs(mat.mean()-mean)<0.5)True>>>frommathimportsqrt>>>bool(abs(mat.std()-sqrt(mean))<0.5)True
- staticuniformRDD(sc,size,numPartitions=None,seed=None)[source]#
Generates an RDD comprised of i.i.d. samples from theuniform distribution U(0.0, 1.0).
To transform the distribution in the generated RDD from U(0.0, 1.0)to U(a, b), use
RandomRDDs.uniformRDD(sc,n,p,seed).map(lambdav:a+(b-a)*v)New in version 1.1.0.
- Parameters
- sc
pyspark.SparkContext used to create the RDD.
- sizeint
Size of the RDD.
- numPartitionsint, optional
Number of partitions in the RDD (default:sc.defaultParallelism).
- seedint, optional
Random seed (default: a random long integer).
- sc
- Returns
pyspark.RDDRDD of float comprised of i.i.d. samples ~U(0.0, 1.0).
Examples
>>>x=RandomRDDs.uniformRDD(sc,100).collect()>>>len(x)100>>>max(x)<=1.0andmin(x)>=0.0True>>>RandomRDDs.uniformRDD(sc,100,4).getNumPartitions()4>>>parts=RandomRDDs.uniformRDD(sc,100,seed=4).getNumPartitions()>>>parts==sc.defaultParallelismTrue
- staticuniformVectorRDD(sc,numRows,numCols,numPartitions=None,seed=None)[source]#
Generates an RDD comprised of vectors containing i.i.d. samples drawnfrom the uniform distribution U(0.0, 1.0).
New in version 1.1.0.
- Parameters
- sc
pyspark.SparkContext SparkContext used to create the RDD.
- numRowsint
Number of Vectors in the RDD.
- numColsint
Number of elements in each Vector.
- numPartitionsint, optional
Number of partitions in the RDD.
- seedint, optional
Seed for the RNG that generates the seed for the generator in each partition.
- sc
- Returns
pyspark.RDDRDD of Vector with vectors containing i.i.d samples ~U(0.0, 1.0).
Examples
>>>importnumpyasnp>>>mat=np.matrix(RandomRDDs.uniformVectorRDD(sc,10,10).collect())>>>mat.shape(10, 10)>>>bool(mat.max()<=1.0andmat.min()>=0.0)True>>>RandomRDDs.uniformVectorRDD(sc,10,10,4).getNumPartitions()4