pyspark.SparkContext.addFile #

SparkContext.addFile(path,recursive=False)[source]#

Add a file to be downloaded with this Spark job on every node.Thepath passed can be either a local file, a file in HDFS(or other Hadoop-supported filesystems), or an HTTP, HTTPS orFTP URI.

To access the file in Spark jobs, useSparkFiles.get() with thefilename to find its download location.

A directory can be given if the recursive option is set to True.Currently directories are only supported for Hadoop-supported filesystems.

New in version 0.7.0.

Parameters

pathstr: can be either a local file, a file in HDFS (or other Hadoop-supportedfilesystems), or an HTTP, HTTPS or FTP URI. To access the file in Spark jobs,useSparkFiles.get() to find its download location.
recursivebool, default False: whether to recursively add files in the input directory

See also

SparkContext.listFiles()
SparkContext.addPyFile()
SparkFiles.get()

Notes

A path can be added only once. Subsequent additions of the same path are ignored.

Examples

>>>importos>>>importtempfile>>>frompysparkimportSparkFiles

>>>withtempfile.TemporaryDirectory(prefix="addFile")asd:...path1=os.path.join(d,"test1.txt")...withopen(path1,"w")asf:..._=f.write("100")......path2=os.path.join(d,"test2.txt")...withopen(path2,"w")asf:..._=f.write("200")......sc.addFile(path1)...file_list1=sorted(sc.listFiles)......sc.addFile(path2)...file_list2=sorted(sc.listFiles)......# add path2 twice, this addition will be ignored...sc.addFile(path2)...file_list3=sorted(sc.listFiles)......deffunc(iterator):...withopen(SparkFiles.get("test1.txt"))asf:...mul=int(f.readline())...return[x*mulforxiniterator]......collected=sc.parallelize([1,2,3,4]).mapPartitions(func).collect()

>>>file_list1['file:/.../test1.txt']>>>file_list2['file:/.../test1.txt', 'file:/.../test2.txt']>>>file_list3['file:/.../test1.txt', 'file:/.../test2.txt']>>>collected[100, 200, 300, 400]

Show Source

Movatterモバイル変換

pyspark.SparkContext.addFile#

pyspark.SparkContext.addFile #