pyspark.SparkContext.textFile#
- SparkContext.textFile(name,minPartitions=None,use_unicode=True)[source]#
Read a text file from HDFS, a local file system (available on allnodes), or any Hadoop-supported file system URI, and return it as anRDD of Strings. The text files must be encoded as UTF-8.
New in version 0.7.0.
- Parameters
- namestr
directory to the input data files, the path can be comma separatedpaths as a list of inputs
- minPartitionsint, optional
suggested minimum number of partitions for the resulting RDD
- use_unicodebool, default True
Ifuse_unicode is False, the strings will be kept asstr (encodingasutf-8), which is faster and smaller than unicode.
New in version 1.2.0.
- Returns
RDDRDD representing text data from the file(s).
Examples
>>>importos>>>importtempfile>>>withtempfile.TemporaryDirectory(prefix="textFile")asd:...path1=os.path.join(d,"text1")...path2=os.path.join(d,"text2")......# Write a temporary text file...sc.parallelize(["x","y","z"]).saveAsTextFile(path1)......# Write another temporary text file...sc.parallelize(["aa","bb","cc"]).saveAsTextFile(path2)......# Load text file...collected1=sorted(sc.textFile(path1,3).collect())...collected2=sorted(sc.textFile(path2,4).collect())......# Load two text files together...collected3=sorted(sc.textFile('{},{}'.format(path1,path2),5).collect())
>>>collected1['x', 'y', 'z']>>>collected2['aa', 'bb', 'cc']>>>collected3['aa', 'bb', 'cc', 'x', 'y', 'z']