What is the difference between schema on Read and Schema on Write?

Free Online Certification Courses – Learn Today. Lead Tomorrow. Forums Apache Hadoop What is the difference between schema on Read and Schema on Write?

Viewing 4 reply threads
  • Author
    Posts
    • #5816
      DataFlair TeamDataFlair Team
      Spectator

      What is the difference between schema on Read and Schema on Write?
      In Hive why schema on Read has been preferred over Schema on Write?

    • #5817
      DataFlair TeamDataFlair Team
      Spectator

      Schema-on-Write (RDBMS):

      Prescriptive Data Modeling:
      Create static DB schema
      Transform data into RDBMS
      Query data in RDBMS format
      New columns must be added explicitly before new data can propagate into the system.
      Good for Known Unknowns (Repetition)
      Schema-on-Read (Hadoop):

      Descriptive Data Modeling:
      Copy data in its native format
      Create schema + parser
      Query Data in its native format (does ETL on the fly)
      New data can start flowing any time and will appear retroactively once the schema/parser properly describes it.
      Good for Unknown Unknowns (Exploration)

    • #5819
      DataFlair TeamDataFlair Team
      Spectator

      Schema-on-Read vs. Schema-on-Write:

      In traditional databases, the table’s schema is imposed during the data load time, if the data being loaded does not conform to the schema then the data load is rejected, this process is know as Schema-on-Write. Here the data is being checked against the schema when written into the database(during data load).

      Now in HIVE, the data schema is not verified during the load time, rather it is verified while processing the query. Hence this process in HIVE called Schema-on-Read.

      Now, which way is better? Schema-on-Read or Schema-on-Write?

      Schema-on-Read:
      Schema-on-Read helps in very fast initial data load, since the data does not have to follow any internal schema(internal database format) to read or parse or serialize, as it is just a copy/move of a file.
      This type of movement of data is more flexible incase of huge data or having two schemas for same underlying data.

      Schema-on-Write:
      Schema-on-Write helps in faster performance of the query, as the data is already loaded in a particular format and it is easy to locate the column index or compress the data. However, it takes longer time to load data into the database.

      So, in scenarios of large data load or where the schema is not known at load time and there are no indexes to apply, as the query is not formulated, Schema-on-Read(property of HIVE) is more efficient than Schema-on-write.

    • #5820
      DataFlair TeamDataFlair Team
      Spectator

      Schema on write is defined as creating a schema for data before writing into the database.
      The schema; the columns, rows, tables, and relationships are all defined first for the specific purpose that database will serve

      The main advantages of schema on write are precision and query speed..

      Schema on read differs from schema on write because schema only created when reading the data. Structured is applied to the data only when it’s read, this allows unstructured data to be stored in the database. Since it’s not necessary to define the schema before storing the data it makes it easier to bring in new data sources on the fly.

      The main advantages of schema on read are flexibility in purpose and query power.

    • #5822
      DataFlair TeamDataFlair Team
      Spectator

      Schema on read Vs Schema on write.

      In traditional way(RDBMS), Before storing data, a schema is created for a data before it is written into the database. So suppose we create a Schema consisting of 10 Columns and we try to load data which could satisfy only 9 columns that data would be rejected because of Schema on write, here the data is read against schema before it is written to the database.
      The main advantages of RDBMS are that its Query response time is immediate and precision.

      With HIVE, we have Schema on read, which means the data is not verified before loading but rather when a query is issued, here we have very fast initial load as the data is not read.

Viewing 4 reply threads
  • You must be logged in to reply to this topic.