pyspark.SparkContext.binaryFiles #

SparkContext.binaryFiles(path,minPartitions=None)[source]#

Read a directory of binary files from HDFS, a local file system(available on all nodes), or any Hadoop-supported file system URIas a byte array. Each file is read as a single record and returnedin a key-value pair, where the key is the path of each file, thevalue is the content of each file.

New in version 1.3.0.

Parameters

pathstr: directory to the input data files, the path can be comma separatedpaths as a list of inputs
minPartitionsint, optional: suggested minimum number of partitions for the resulting RDD

Returns

RDD: RDD representing path-content pairs from the file(s).

See also

SparkContext.binaryRecords()

Notes

Small files are preferred, large file is also allowable, but may cause bad performance.

Examples

>>>importos>>>importtempfile>>>withtempfile.TemporaryDirectory(prefix="binaryFiles")asd:...# Write a temporary binary file...withopen(os.path.join(d,"1.bin"),"wb")asf1:..._=f1.write(b"binary data I")......# Write another temporary binary file...withopen(os.path.join(d,"2.bin"),"wb")asf2:..._=f2.write(b"binary data II")......collected=sorted(sc.binaryFiles(d).collect())

>>>collected[('.../1.bin', b'binary data I'), ('.../2.bin', b'binary data II')]

Show Source

Movatterモバイル変換

pyspark.SparkContext.binaryFiles#

pyspark.SparkContext.binaryFiles #