what are the file format in hadoop?

Viewing 1 reply thread
  • Author
    Posts
    • #5267
      DataFlair TeamDataFlair Team
      Spectator

      what are the file format in hadoop? explain in details?
      use of each file format?

    • #5268
      DataFlair TeamDataFlair Team
      Spectator

      1. Text/CSV Files
      2. JSON Records
      3. Avro Files
      4. Sequence Files
      5. RC Files
      6. ORC Files
      7. Parquet Files
      Text/CSV Files:CSV file is the most commonly used data file format. It’s the most readable and also ubiquitously easy to parse. It’s the choice of format to use when export data from an RDBMS table. However, human readable does not mean it’s machine readable. It has three major drawbacks when used for HDFS. First of all, all lines in a CSV file is a record, therefore, we should not include any headers or footers. In other word, CSV file cannot be stored in HDFS with any meta data. Second of all, CSV file has very limited support for schema evolution. Because the fields for each record are ordered, we are not able to change the orders. We can only append new fields to the end of each line. Last, CSV file does not support block compression which many other file formats support. The whole file has to be compressed and decompressed for reading, adding a significant read performance cost to the files.
      Text and CSV files are quite common and frequently Hadoop developers and data scientists received text and CSV files to work upon.
      However CSV files do not support block compression, thus compressing a CSV file in Hadoop often comes at a significant read performance cost.
      That is why if you are working with text or CSV files, don’t include header ion the file else it will give you null value while computing the data. (In Hive Tutorial I will let you know how to deal with it).
      Each line in these files should have a record and so there is no metadata stored in these files.
      You must know how the file was written in order to make use of it.
      JSON Files:JSON is in text format that stores meta data with the data, so it fully supports schema evolution. You can easily add or remove attributes for each datum. However, because it’s text file, it doesn’t support block compression.
      JSON records contain JSON files where each line is its own JSON datum. In the case of JSON files, metadata is stored and the file is also splittable but again it also doesn’t support block compression.
      The only issue is there is not much support in Hadoop for JSON file but thanks to the third party tools which helps a lot. Just do the experiment and get your work done.
      Avro Files:Avro File is serialized data in binary format. It uses JSON to define data types, therefore it is row based. It is the most popular storage format for Hadoop. Avro stores meta data with the data, and it also allows specification of independent schema used for reading the files. Therefore, you can easily add, delete, update data fields by just creating a new independent schema. Also, Avro files are splittable, support block compression and enjoys a wide arrange of tool support within Hadoop ecosystem.
      Avro is quickly becoming the top choice for the developers due to its multiple benefits. Avro stores metadata with the data itself and allows specification of an independent schema for reading the file.
      You can rename, add, delete and change the data types of fields by defining a new independent schema. Also, Avro files are splittable, support block compression and enjoy broad, relatively mature, tool support within the Hadoop ecosystem.
      Sequence Files:Sequence files are binary files with a CSV-like structure. It does not store meta data, nor does it support schema evolution, but it does support block compression. Due to its unreadability, they are mostly used for intermediate data storage within a sequence of MapReduce jobs.
      Sequence file stores data in binary format and has a similar structure to CSV file with some differences. It also doesn’t store metadata and so only schema evolution option is appending new fields but it supports block compression.
      Due to complexity, sequence files are mainly used in flight data as an intermediate storage.
      RC File (Record Columnar Files):RC file was the first columnar file in Hadoop and has significant compression and query performance benefits.
      But it doesn’t support schema evaluation and if you want to add anything to RC file you will have to rewrite the file. Also, it is a slower process.
      ORC Files:RC files or Record Columnar files are columnar file format. It’s great for compression and best for query performance, with the sacrifice of cost of more memory and poor write performance. ORC are optimized RC files that works better with Hive. It compresses better, but still does not support schema evolution. It is worthwhile to note that OCR is a format primarily backed by Hortonworks, and it’s not supported by Cloudera Impala.
      ORC is the compressed version of RC file and supports all the benefits of RC file with some enhancements like ORC files compress better than RC files, enabling faster queries.
      But it doesn’t support schema evolution. Some benchmarks indicate that ORC files compress to be the smallest of all file formats in Hadoop.
      Parquet Files:Paquet file format is also a columnar format. Just like ORC file, it’s great for compression with great query performance. It’s especially efficient when querying data from specific columns. Parquet format is computationally intensive on the write side, but it reduces a lot of I/O cost to make great read performance. It enjoys more freedom than ORC file in schema evolution, that it can add new columns to the end of the structure. It is also backed by Cloudera and optimized with Impala.
      Since Avro and Parquet have so much in common, let’s review a little bit more of both. When choosing a file format to use with HDFS, we need to consider read performance and write performance. Because the nature of HDFS is to store data that is write once, read multiple times, we want to emphasize on the read performance. The fundamental difference in terms of how to use either format is this: Avro is a Row based format. If you want to retrieve the data as a whole, you can use Avro. Parquet is a Column based format. If your data consists of lot of columns but you are interested in a subset of columns, you can use Parquet.
      Parquet file is another columnar file given by Hadoop founder Doug Cutting during his Trevni project. Like another Columnar file RC & ORC, Parquet also enjoys the features like compression and query performance benefits but is generally slower to write than non-columnar file formats.
      In Parquet format, new columns can be added at the end of the structure. This format was mainly optimized for Cloudera Impala but aggressively getting popularity in other ecosystems as well.
      One thing you should note here is, if you are working with Parquet file in Hive then you should take some precautions.

Viewing 1 reply thread
  • You must be logged in to reply to this topic.