textFile Vs wholeTextFile in Spark

      Explain textFile Vs wholeTextFile in Spark

      • Both are the method of org.apache.spark.SparkContext.

      textFile() :

      • def textFile(path: String, minPartitions: Int = defaultMinPartitions): RDD[String]
      • Read a text file from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI, and return it as an RDD of Strings
      • For example sc.textFile(“/home/hdadmin/wc-data.txt”) so it will create RDD in which each individual line an element.
      • Everyone knows the use of textFile.

      wholeTextFiles() :

      • def wholeTextFiles(path: String, minPartitions: Int = defaultMinPartitions): RDD[(String, String)]
      • Read a directory of text files from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI.
      • Rather than create basic RDD, the wholeTextFile() returns pairRDD.
      • For example, you have few files in a directory so by using wholeTextFile() method,
        it creates pair RDD with filename with path as key,
        and value being the whole file as string
      val myfilerdd = sc.wholeTextFiles("/home/hdadmin/MyFiles")
      val keyrdd = myfilerdd.keys
      val filerdd = myfilerdd.values

      Output :
      Array[String] = Array(

      Array[String] =
