pyspark.SparkContext.wholeTextFiles¶

SparkContext。 wholeTextFiles ( 路径:str,minPartitions:可选(int]=没有一个,use_unicode:bool=真正的 )→pyspark.rdd.RDD(元组(str,str] ] ¶

从HDFS读取文本文件的目录,一个本地文件系统(可在所有节点),或任何文件系统Hadoop-supported URI。每个文件读取并返回一个记录的键-值对,关键是每个文件的路径,该值为每个文件的内容。文本文件必须编码为utf - 8。

如果use_unicode是假的,字符串将被作为str(编码utf - 8),这是更快,小于unicode。(1.2中添加火花)

例如,如果您有以下文件:

           hdfs: / / a-hdfs-path / - 00000 hdfs部分:/ / a-hdfs-path /部分- 00001…hdfs: / / a-hdfs-path / part-nnnnn
          

做抽样=sparkContext.wholeTextFiles (“hdfs: / / a-hdfs-path”),然后抽样包含:

           (a-hdfs-path / - 00000部分,其内容)(a-hdfs-path / - 00001部分,其内容)……(a-hdfs-path / part-nnnnn,其内容)
          

笔记

小文件是首选,因为每个文件将完全在内存中加载。

例子

           > > >dirPath=操作系统。路径。加入(tempdir,“文件”)> > >操作系统。mkdir(dirPath)> > >与开放(操作系统。路径。加入(dirPath,“1. txt”),“w”)作为file1:…_=file1。写(“1”)> > >与开放(操作系统。路径。加入(dirPath,“2. txt”),“w”)作为file2:…_=file2。写(“2”)> > >文本文件=sc。wholeTextFiles(dirPath)> > >排序(文本文件。收集())((“…/ 1。txt”、“1”), (“…/ 2。txt”、“2”)
          

以前的

pyspark.SparkContext.version

下一个

pyspark.RDD.aggregate