- 1. Objective of Flink compatibility with Hadoop tutorial
- 2. Hadoop Compatibility with Flink
- 3. Flink Hadoop Compatibility Package
- 4. Using Hadoop Data Types
- 5. Project Configuration
- 6. Using Hadoop InputFormats
- 7. Using Hadoop OutputFormats
- 8. Using Hadoop Mappers and Reducers
- 9. How to use Hadoop functions in Flink program
- 10. Summary of Hadoop Flink compatibility
This tutorial on Flink compatibility with Hadoop will help you in understanding how Apache Flink is compatible with Big data Hadoop. It will also help you in learning basics of Big Data hadoop and Apache Flink along with comparison between MapReduce and Flink to help you in getting jobs in Apache Flink with high paid salary of Flink professionals.
Apache Hadoop is widely used for scalable analytical data processing across the industries. Many applications have been implemented in Hadoop MapReduce that run successfully in clusters. Big data is getting matured with Apache Flink and Flink provides an alternative to MapReduce with some improvements in it. Even if you optimize Hadoop MapReduce jobs, Flink provides much better performance than Apache Spark and Hadoop and offers APIs in Java and Scala, which are very easy to use. There are many main features of Flink that differentiates Flink vs Spark vs Hadoop. Flink’s APIs provide interfaces for Mapper and Reducer functions along with many operators like InputFormats and OutputFormats.
But do you know:
“Though Hadoop MapReduce and Flink are conceptually equivalent, Hadoop’s MapReduce and Flink’s interfaces for these functions are not source compatible.”
To close the Hadoop Flink compatibility gap, a package was developed as part of a Google Summer of Code 2014 project. This package helps in wrapping functions that are implemented against MapReduce interface and embed them in Flink programs. The Hadoop Compatibility package allows you to reuse below Hadoop APIs in Flink programs without making any change in code:
- InputFormats (mapred and mapreduce APIs) as Flink DataSource
- OutputFormats (mapred and mapreduce APIs) as Flink DataSink
- Mappers (mapred API) as FlatMap function
- Reducers (mapred API) as GroupReduce function
4. Using Hadoop Data Types
All Hadoop data types like Writables and WritableComparable are natively supported by Flink. To use Hadoop data types only, you do not need to include Hadoop compatibility dependency.
5. Project Configuration
Flink support for Hadoop Mappers and Reducers is done by flink-hadoop-compatibility Maven module that is always required when writing Flink jobs. This code resides in the org.apache.flink.hadoopcompatibility package.
To reuse mappers and reducers, you need to add the following dependency to your pom.xml
6. Using Hadoop InputFormats
readHadoopFile(for input formats derived from FileInputFormat) or createHadoopInput(For general purpose input formats) of the ExecutionEnvironment can be used to create Hadoop input formats as a data source in Flink. The resulting DataSet has 2-tuples of key and value retrieved from the Hadoop InputFormat.
Learn how to use Hadoop TextInputFormat below:
ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
DataSet<Tuple2<LongWritable, Text>> input = env.readHadoopFile(new TextInputFormat(), LongWritable.class, Text.class, textPath);
7. Using Hadoop OutputFormats
For Hadoop OutputFormats, compatibility wrapper is provided by Flink. Class that implements org.apache.hadoop.mapred.OutputFormat or extends org.apache.hadoop.mapreduce.OutputFormat is supported. The OutputFormat wrapper expects its input data to be a DataSet of 2-tuples of key and value that are to be processed by the Hadoop OutputFormat.
8. Using Hadoop Mappers and Reducers
Flink’s FlatMap functions and GroupReduce functions are equivalent to hadoop Mappers and Reducers respectively. You can use hadoop’s Mapper and Reduce interfaces of Hadoop’s mapred API as such in Flink.
Flink’s function wrappers that can be used as regular Flink FlatMapFunctions or GroupReduceFunctions are
- apache.flink.hadoopcompatibility.mapred.HadoopReduceFunction, and
You can use Hadoop functions at any position within a Flink program and mix them with native Flink functions. This means you can implement an arbitrary complex Flink program consisting of multiple Hadoop InputFormats and OutputFormats, Mapper and Reducer functions without assembling a workflow of Hadoop jobs in an external driver method or using a workflow scheduler like Apache Oozie.
As you have seen, Flink lets you reuse the code that you wrote for Hadoop MapReduce, including all data types, all InputFormats and OutputFormats and Mapper and Reducers of the mapred-API. You can also use Hadoop functions within Flink programs and mix them with all other Flink functions. Flink’s pipelined execution allows to arbitrarily assemble Hadoop functions without data exchange via HDFS.