pyspark.RDD.saveAsTextFile#
- RDD.saveAsTextFile(path,compressionCodecClass=None)[source]#
Save this RDD as a text file, using string representations of elements.
New in version 0.7.0.
- Parameters
- pathstr
path to text file
- compressionCodecClassstr, optional
fully qualified classname of the compression codec classi.e. “org.apache.hadoop.io.compress.GzipCodec” (None by default)
Examples
>>>importos>>>importtempfile>>>fromfileinputimportinput>>>fromglobimportglob>>>withtempfile.TemporaryDirectory(prefix="saveAsTextFile1")asd1:...path1=os.path.join(d1,"text_file1")......# Write a temporary text file...sc.parallelize(range(10)).saveAsTextFile(path1)......# Load text file as an RDD...''.join(sorted(input(glob(path1+"/part-0000*"))))'0\n1\n2\n3\n4\n5\n6\n7\n8\n9\n'
Empty lines are tolerated when saving to text files.
>>>withtempfile.TemporaryDirectory(prefix="saveAsTextFile2")asd2:...path2=os.path.join(d2,"text2_file2")......# Write another temporary text file...sc.parallelize(['','foo','','bar','']).saveAsTextFile(path2)......# Load text file as an RDD...''.join(sorted(input(glob(path2+"/part-0000*"))))'\n\n\nbar\nfoo\n'
Using compressionCodecClass
>>>fromfileinputimportinput,hook_compressed>>>withtempfile.TemporaryDirectory(prefix="saveAsTextFile3")asd3:...path3=os.path.join(d3,"text3")...codec="org.apache.hadoop.io.compress.GzipCodec"......# Write another temporary text file with specified codec...sc.parallelize(['foo','bar']).saveAsTextFile(path3,codec)......# Load text file as an RDD...result=sorted(input(glob(path3+"/part*.gz"),openhook=hook_compressed))...''.join([r.decode('utf-8')ifisinstance(r,bytes)elserforrinresult])'bar\nfoo\n'