pyspark.SparkContext.binaryFiles#
- SparkContext.binaryFiles(path,minPartitions=None)[source]#
Read a directory of binary files from HDFS, a local file system(available on all nodes), or any Hadoop-supported file system URIas a byte array. Each file is read as a single record and returnedin a key-value pair, where the key is the path of each file, thevalue is the content of each file.
New in version 1.3.0.
- Parameters
- pathstr
directory to the input data files, the path can be comma separatedpaths as a list of inputs
- minPartitionsint, optional
suggested minimum number of partitions for the resulting RDD
- Returns
RDDRDD representing path-content pairs from the file(s).
See also
Notes
Small files are preferred, large file is also allowable, but may cause bad performance.
Examples
>>>importos>>>importtempfile>>>withtempfile.TemporaryDirectory(prefix="binaryFiles")asd:...# Write a temporary binary file...withopen(os.path.join(d,"1.bin"),"wb")asf1:..._=f1.write(b"binary data I")......# Write another temporary binary file...withopen(os.path.join(d,"2.bin"),"wb")asf2:..._=f2.write(b"binary data II")......collected=sorted(sc.binaryFiles(d).collect())
>>>collected[('.../1.bin', b'binary data I'), ('.../2.bin', b'binary data II')]