Movatterモバイル変換


[0]ホーム

URL:


Skip to main content
Ctrl+K

pyspark.SparkContext.binaryFiles#

SparkContext.binaryFiles(path,minPartitions=None)[source]#

Read a directory of binary files from HDFS, a local file system(available on all nodes), or any Hadoop-supported file system URIas a byte array. Each file is read as a single record and returnedin a key-value pair, where the key is the path of each file, thevalue is the content of each file.

New in version 1.3.0.

Parameters
pathstr

directory to the input data files, the path can be comma separatedpaths as a list of inputs

minPartitionsint, optional

suggested minimum number of partitions for the resulting RDD

Returns
RDD

RDD representing path-content pairs from the file(s).

Notes

Small files are preferred, large file is also allowable, but may cause bad performance.

Examples

>>>importos>>>importtempfile>>>withtempfile.TemporaryDirectory(prefix="binaryFiles")asd:...# Write a temporary binary file...withopen(os.path.join(d,"1.bin"),"wb")asf1:..._=f1.write(b"binary data I")......# Write another temporary binary file...withopen(os.path.join(d,"2.bin"),"wb")asf2:..._=f2.write(b"binary data II")......collected=sorted(sc.binaryFiles(d).collect())
>>>collected[('.../1.bin', b'binary data I'), ('.../2.bin', b'binary data II')]

[8]ページ先頭

©2009-2025 Movatter.jp