Apache Flume Interceptors | Types of Interceptors in Flume

Boost your career with Free Big Data Courses!!

In this Apache Flume Tutorial, we talk about Apache Flume interceptors. Interceptors in Flume are those who have the capability to modify/drop events in-flight. So, in this blog, we will learn the whole concept of Apache Flume interceptors.

Also, we will see several types of Interceptors in Flume: Host Flume Interceptors, Morphline Interceptor, Flume Interceptors Regex Extractor, Regex Filtering Interceptor, Remove Header Interceptor, Search and Replace Interceptor, Static Interceptor, Timestamp Interceptors, and UUID Interceptor to understand this topic well.

Moreover, we will cover interceptors in Apache Flume examples and search the answer to this question, how to add interceptor in Flume to learn it more clearly.

What is Flume Interceptors

Basically, we can modify/drop events in-flight with the help of Apache Flume. It has the capability. So, this process takes place with the help of interceptors in Flume.

Moreover, they are the classes that implement org.apache.flume.interceptor.Interceptor interface. Also, can modify or even drop events based on any criteria chosen by the developer of the interceptor.

In addition, Apache Flume supports chaining of interceptors. It is only possible through by specifying the list of interceptor builder class names in the configuration. Although, in the source configuration Flume interceptors are specified as a whitespace separated list.

However, the order in which they are invoked, is the order in which the interceptors are specified. They are named components. So let’s see an example of how we create Flume Interceptors through configuration:

For example,
a1.sources = r1
a1.sinks = k1
a1.channels = c1
a1.sources.r1.interceptors = i1 i2
a1.sources.r1.interceptors.i1.type = org.apache.flume.interceptor.HostInterceptor$Builder
a1.sources.r1.interceptors.i1.preserveExisting = false
a1.sources.r1.interceptors.i1.hostHeader = hostname
a1.sources.r1.interceptors.i2.type = org.apache.flume.interceptor.TimestampInterceptor$Builder
a1.sinks.k1.filePrefix = FlumeData.%{CollectorHost}.%Y-%m-%d
a1.sinks.k1.channel = c1

It is very important to note that Flume interceptor builders are passed to the type config parameter.

Types of Interceptors in Flume

There are 9 types of Flume Interceptors, let’s discuss them one by one:

Flume Interceptors

Types of Flume Interceptors

i. Timestamp: Apache Flume Interceptors

While it comes to Timestamp Flume interceptor, it inserts into the event headers, the time in millis at which it processes the event. Moreover, we can say it inserts a header with the key timestamp or as specified by the header property, whose value is the relevant timestamp.

Also, make sure if it is already present in the configuration, this interceptor can preserve an existing timestamp. The below table shows the property name and description of Timestamp Flume Interceptors.
Table.1 – Apache Flume Interceptor

Property NameDefaultDescription
typeThe component type name has to be timestamp or the FQCN
header timestampThe name of the header in which to place the generated timestamp.
preserveExistingfalseIf the timestamp already exists, should it be preserved – true or false

So, let’s see an example for agent a1:
a1.sources = r1
a1.channels = c1
a1.sources.r1.channels =  c1
a1.sources.r1.type = seq
a1.sources.r1.interceptors = i1
a1.sources.r1.interceptors.i1.type = timestamp

ii. Host: Apache Flume Interceptors

Basically, it inserts the hostname or IP address of the host that this agent is running on. Moreover, with the key host or a configured key (whose value is the hostname or IP address of the host) on the basis of configuration, it inserts a header.

The Below table shows the property name and description of the property of Host Flume Interceptors.
Table.2 – Apache Flume Interceptor

Property NameDefaultDescription
typeThe component type name has to be host
preserveExistingfalseIf the host header already exists, should it be preserved – true or false
useIPtrueUse the IP Address if true, else use hostname.
hostHeaderhostThe header key to be used.

Also, see an example for agent a1:
a1.sources = r1
a1.channels = c1
a1.sources.r1.interceptors = i1
a1.sources.r1.interceptors.i1.type = host

iii. Static: Apache Flume Interceptors

While it comes to append a static header with static value to all events, it is possible with the Static Flume interceptors. 

Below table shows the property name and description of the property of Static Flume Interceptors.
Table.3 – Apache Flume Interceptor

Property NameDefaultDescription
typeThe component type name has to be static
preserveExistingtrueIf configured header already exists, should it be preserved – true or false
keykeyName of the header that should be created
valuevalueStatic value that should be created

So, let’s see an example for agent a1:
a1.sources = r1
a1.channels = c1
a1.sources.r1.channels =  c1
a1.sources.r1.type = seq
a1.sources.r1.interceptors = i1
a1.sources.r1.interceptors.i1.type = static
a1.sources.r1.interceptors.i1.key = datacenter
a1.sources.r1.interceptors.i1.value = NEW_YORK

Note: It does not allow specifying multiple headers at one time in its current implementation. Despite that, as a user, we can chain multiple static interceptors each defining one static header.

iv. Remove Header: Apache Flume Interceptors

However, by removing one or many headers, this interceptor manipulates Flume event headers. By these Flume interceptors, We can remove a statically defined header. Either header based on a regular expression or headers in a list.

Although, make sure the Flume events are not modified, if none of these is defined, or if no header matches the criteria. Below table shows the property name and description of the property of Remove Header Flume Interceptors.

Table.4 – Apache Flume Interceptor

Property NameDefaultDescription
typeThe component type name has to be remove_header
withNameName of the header to remove
fromListList of headers to remove, separated with the separator specified by fromListSeparator
fromListSeparator\s*,\s*Regular expression used to separate multiple header names in the list specified by fromList. Default is a comma surrounded by any number of whitespace characters
matchingAll the headers which names match this regular expression are removed

Note:  Since, we need to remove only one header, specifying it by name provides performance benefits over the other 2 methods.

v. UUID: Apache Flume Interceptors

With the use of UUID Flume interceptors, we generally set a universally unique identifier on all events those are intercepted. Let’s see an example,  UUID is b5755073-77a9-43c1-8fad-b7a586fc1b97, that represents a 128-bit value.

In addition, to automatically assign a UUID to an event consider using UUIDInterceptor. Since no application level, the unique key for the event is available. As soon as they enter the Flume network; that is, in the first Flume Source of the flow, it is important to assign UUIDs to events.

In the face of replication and redelivery in a Flume network, this enables subsequent deduplication of events that are designed for high availability and high performance. Moreover, this is preferable over an auto-generated UUID if an application level key is available.

Since it enables subsequent updates and deletes of the event in data stores using said well-known application level key. Below table shows the property name and description of the property of UUID Flume Interceptors.

Table.5 – Apache Flume Interceptor

Property NameDefaultDescription
typeThe component type name has to be org.apache.flume.sink.solr.morphline.UUIDInterceptor$Builder
headerNameidThe name of the Flume header to modify
preserveExistingtrueIf the UUID header already exists, should it be preserved – true or false
prefixThe prefix string constant to prepend to each generated UUID

vi. Morphline: Apache Flume Interceptors

While it comes to filters the events through a morphline configuration file we use Morphline Interceptor. Basically, that defines a chain of transformation commands that pipe records from one command to another.

Let, understand this with an example, via regular expression based pattern matching the morphline can ignore certain events or alter or insert certain event headers. Also,  via Apache Tika on events that are intercepted, it can auto-detect and set a MIME type.

For example, in a Flume topology, we can use this kind of packet sniffing for content-based dynamic routing. Moreover, we can also use MorphlineInterceptor to implement dynamic routing for multiple Apache Solr collections. Such as multi-tenancy. 

Below table shows the property name and description of the property of Morphline Flume Interceptors.
Table.6 – Apache Flume Interceptor

Property Name    DefaultDescription
typeThe component type name has to be org.apache.flume.sink.solr.morphline.MorphlineInterceptor$Builder
morphlineFileThe relative or absolute path on the local file system to the morphline configuration file. Example: /etc/flume-ng/conf/morphline.conf
morphlineIdnullOptional name used to identify a morphline if there are multiple morphlines in a morphline config file

Sample flume.conf file:

a1.sources.avroSrc.interceptors = morphlineinterceptor
a1.sources.avroSrc.interceptors.morphlineinterceptor.type= org.apache.flume.sink.solr.morphline.MorphlineInterceptor$Builder
a1.sources.avroSrc.interceptors.morphlineinterceptor.morphlineFile= /etc/flume-ng/conf/morphline.conf
a1.sources.avroSrc.interceptors.morphlineinterceptor.morphlineId = morphline1

Note: However, the morphline of an interceptor must not generate more than one output record for each input event, currently there is a restriction in that.

Moreover, we can not use these Flume interceptors for heavy duty ETL processing. Though, if you need this consider moving ETL processing from the Flume Source to a Flume Sink. Such as to a MorphlineSolrSink.

vii. Search and Replace: Apache Flume Interceptors

However, on the basis of Java regular expressions, it offers simple string-based search-and-replace functionality. Also, Backtracking/group capture is available.

Moreover, there are same rules used by this interceptor as in the Java Matcher.replaceAll() method. The below table shows Property name and description of Search and replace Flume interceptors.
Table.7 – Apache Flume Interceptor

Property NameDefaultDescription
typeThe component type name has to be search_replace
searchPatternThe pattern to search for and replace.
replaceStringThe replacement string.
charset UTF-8The charset of the event body. Assumed by default to be UTF-8.

So,let’s read an example configuration:

a1.sources.avroSrc.interceptors = search-replace
a1.sources.avroSrc.interceptors.search-replace.type = search_replace
# Remove leading alphanumeric characters in an event body.
a1.sources.avroSrc.interceptors.search-replace.searchPattern = ^[A-Za-z0-9_]+
a1.sources.avroSrc.interceptors.search-replace.replaceString =

Another example:

a1.sources.avroSrc.interceptors = search-replace
a1.sources.avroSrc.interceptors.search-replace.type = search_replace
# Use grouping operators to reorder and munge words on a line.
a1.sources.avroSrc.interceptors.search-replace.searchPattern = The quick brown ([a-z]+) jumped over the lazy ([a-z]+)
a1.sources.avroSrc.interceptors.search-replace.replaceString = The hungry $2 ate the careless $1

viii. Regex Filtering: Apache Flume Interceptors

While it comes to filter events selectively we use Regex Filtering Interceptor. Basically, it is possible by interpreting the event body as text and matching the text against a configured regular expression.

Moreover, to include events or exclude events we can use the supplied regular expression. The following table shows property name along with descriptions of Regex Filtering Flume Interceptors.
Table.8 – Apache Flume Interceptor

Property NameDefaultDescription
typeThe component type name has to be regex_filter
regex”.*”Regular expression for matching against events
excludeEventsfalseIf true, regex determines events to exclude, otherwise, regex determines events to include.

ix. Flume Regex Extractor: Apache Flume Interceptors

While it comes to extracts regex match groups we use Regex Extractor. Basically, it is only possible by using a specified regular expression and appends the match groups as headers on the event.

Also, for formatting the match groups before adding them as event headers it supports pluggable serializers. A table shows Regex Extractor Flume Interceptors.
Table.8 – Apache Flume Interceptor

Property NameDefaultDescription
typeThe component type name has to be regex_extractor
regexRegular expression for matching against events
serializersSpace-separated list of serializers for mapping matches to header names and serializing their values. (See example below) Flume provides built-in support for the following serializers: org.apache.flume.interceptor.RegexExtractorInterceptorPassThroughSerializer org.apache.flume.interceptor.RegexExtractorInterceptorMillisSerializer
serializers.<s1>.typedefaultMust be default (org.apache.flume.interceptor.RegexExtractorInterceptorPassThroughSerializer), org.apache.flume.interceptor.RegexExtractorInterceptorMillisSerializer, or the FQCN of a custom class that implements org.apache.flume.interceptor.RegexExtractorInterceptorSerializer
serializers.<s1>.name
serializers.*Serializer-specific properties

Apache Flume Interceptors – Conclusion

Hence, in this Apache Flume tutorial, we have studied the whole concept of Flume Interceptors. Also, we have seen several types of Flume Interceptors, Timestamp Interceptor, Host Interceptor, Static Interceptor, Remove Header Interceptor, UUID Interceptor, Morphline Interceptor, Search and Replace Interceptor, Regex Filtering Interceptor, Regex Extractor Interceptor.

Moreover, we have seen Apache Flume Interceptors examples to completely understand this topic. Futhermore, if you have any doubt, please ask through the comment section.

We work very hard to provide you quality material
Could you take 15 seconds and share your happy experience on Google

follow dataflair on YouTube

Leave a Reply

Your email address will not be published. Required fields are marked *