textFile Vs wholeTextFile in Spark

Viewing 1 reply thread
  • Author
    Posts
    • #5183
      DataFlair TeamDataFlair Team
      Spectator

      Explain textFile Vs wholeTextFile in Spark

    • #5186
      DataFlair TeamDataFlair Team
      Spectator
        <li style=”list-style-type: none”>
      • Both are the method of org.apache.spark.SparkContext.


      textFile() :

        <li style=”list-style-type: none”>
      • def textFile(path: String, minPartitions: Int = defaultMinPartitions): RDD[String]
      • Read a text file from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI, and return it as an RDD of Strings
      • For example sc.textFile(“/home/hdadmin/wc-data.txt”) so it will create RDD in which each individual line an element.
      • Everyone knows the use of textFile.


      wholeTextFiles() :

        <li style=”list-style-type: none”>
      • def wholeTextFiles(path: String, minPartitions: Int = defaultMinPartitions): RDD[(String, String)]
      • Read a directory of text files from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI.
      • Rather than create basic RDD, the wholeTextFile() returns pairRDD.
      • For example, you have few files in a directory so by using wholeTextFile() method,
        it creates pair RDD with filename with path as key,
        and value being the whole file as string
      val myfilerdd = sc.wholeTextFiles("/home/hdadmin/MyFiles")
      val keyrdd = myfilerdd.keys
      keyrdd.collect
      val filerdd = myfilerdd.values
      filerdd.collect


      Output :
      Array[String] = Array(
      file:/home/hdadmin/MyFiles/JavaSparkPi.java,
      file:/home/hdadmin/MyFiles/sumnumber.txt,
      file:/home/hdadmin/MyFiles/JavaHdfsLR.java,
      file:/home/hdadmin/MyFiles/JavaPageRank.java,
      file:/home/hdadmin/MyFiles/JavaLogQuery.java,
      file:/home/hdadmin/MyFiles/wc-data.txt,
      file:/home/hdadmin/MyFiles/nosum.txt)

      Array[String] =
      Array(“/*
      * Licensed to the Apache Software Foundation (ASF) under one or more
      * contributor license agreements. See the NOTICE file distributed with
      * this work for additional information regarding copyright ownership.
      * The ASF licenses this file to You under the Apache License, Version 2.0
      * (the “License”); you may not use this file except in compliance with
      * the License. You may obtain a copy of the License at

      http://www.apache.org/licenses/LICENSE-2.0

      * Unless required by applicable law or agreed to in writing, software
      * distributed under the License is distributed on an “AS IS” BASIS,
      * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
      * See the License for the specific language governing permissions and
      * …

       

Viewing 1 reply thread
  • You must be logged in to reply to this topic.