Flink Compatibility with Hadoop – Comprehensive Tutorial

Free Flink course with real-time projects Start Now!!

This tutorial on Flink compatibility with Hadoop will help you in understanding how Apache Flink is compatible with Big data Hadoop.

It will also help you in learning the basics of Big Data Hadoop and Apache Flink along with the comparison between MapReduce and Flink to help you in getting jobs in Apache Flink with a high paid salary of Flink professionals.

Before starting with Hadoop Flink compatibility, let us brush up the Flink concepts and Hadoop concepts.

Hadoop Compatibility with Flink

Apache Hadoop is widely used for scalable analytical data processing across the industries. Many applications have been implemented in Hadoop MapReduce that run successfully in clusters.

Big data is getting matured with Apache Flink and Flink provides an alternative to MapReduce with some improvements in it.

Even if you optimize Hadoop MapReduce jobs, Flink provides much better performance than Apache Spark and Hadoop and offers APIs in Java and Scala, which are very easy to use. There are many main features of Flink that differentiate Flink vs Spark vs Hadoop.

Flink’s APIs provide interfaces for Mapper and Reducer functions along with many operators like InputFormats and OutputFormats.

But do you know:

“Though Hadoop MapReduce and Flink are conceptually equivalent, Hadoop’s MapReduce and Flink’s interfaces for these functions are not source compatible.”

Flink Hadoop Compatibility Package

To close the Hadoop Flink compatibility gap, a package was developed as part of a Google Summer of Code 2014 project. This package helps in wrapping functions that are implemented against the MapReduce interface and embed them in Flink programs.

The Hadoop Compatibility package allows you to reuse below Hadoop APIs in Flink programs without making any change in code:

  • InputFormats (mapred and mapreduce APIs) as Flink DataSource
  • OutputFormats (mapred and mapreduce APIs) as Flink DataSink
  • Mappers (mapred API) as FlatMap function
  • Reducers (mapred API) as GroupReduce function

Using Hadoop Data Types

Flink natively supports all Hadoop data types like Writables and WritableComparable. To use Hadoop data types only, you do not need to include Hadoop compatibility dependency.

Project Configuration

Flink support for Hadoop Mappers and Reducers is done by Flink-Hadoop-compatibility Maven module that is always required when writing Flink jobs. This code resides in the org.apache.flink.hadoopcompatibility package.

To reuse mappers and reducers, you need to add the following dependency to your pom.xml

<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-hadoop-compatibility_2.10</artifactId>
<version>1.1.3</version>
</dependency>

Using Hadoop InputFormats

readHadoopFile(for input formats derived from FileInputFormat) or createHadoopInput(For general purpose input formats) of the execution environment we can use to create Hadoop input formats as a data source in Flink.

The resulting DataSet has 2-tuples of key and value retrieved from the Hadoop InputFormat.
Learn how to use Hadoop TextInputFormat below:

ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
DataSet<Tuple2<LongWritable, Text>> input = env.readHadoopFile(new TextInputFormat(), LongWritable.class, Text.class, textPath);

Using Hadoop OutputFormats

For Hadoop OutputFormats, compatibility wrapper is provided by Flink. The class supports those that implements org.apache.hadoop.mapred.OutputFormat or extends org.apache.hadoop.mapreduce.OutputFormat. The OutputFormat wrapper expects its input data to be a DataSet of 2-tuples of key and value that will process by the Hadoop OutputFormat.

Using Hadoop Mappers and Reducers

Flink’s FlatMap functions and GroupReduce functions are equivalent to hadoop Mappers and Reducers respectively. You can use Hadoop’s Mapper and Reduce interfaces of Hadoop’s mapred API as such in Flink.

Flink’s function wrappers that we can use as regular Flink FlatMapFunctions or GroupReduceFunctions are

  • apache.flink.hadoopcompatibility.mapred.HadoopMapFunction,
  • apache.flink.hadoopcompatibility.mapred.HadoopReduceFunction, and
  • apache.flink.hadoopcompatibility.mapred.HadoopReduceCombineFunction.

How to use Hadoop Functions in Flink program

You can use Hadoop functions at any position within a Flink program and mix them with native Flink functions.

This means you can implement an arbitrary complex Flink program consisting of multiple Hadoop InputFormats and OutputFormats, Mapper and Reducer functions without assembling a workflow of Hadoop jobs in an external driver method or using a workflow scheduler like  Apache Oozie.

Conclusion – Hadoop Flink compatibility

Hence, in this Flink Compatability with Hadoop tutorial, we saw Flink lets us reuse the code that we wrote for Hadoop MapReduce, including all data types, all InputFormats and OutputFormats and Mapper and Reducers of the mapred-API.

Also, we can use Hadoop functions within Flink programs and mix them with all other Flink functions. Moreover, Flink’s pipelined execution allows to arbitrarily assemble Hadoop functions without data exchange via HDFS.

So, this was all in Hadoop Flink compatability. Still, if you have any query, comment below.

Your opinion matters
Please write your valuable feedback about DataFlair on Google

follow dataflair on YouTube

Leave a Reply

Your email address will not be published. Required fields are marked *