Apache Flume Interceptors | Types of Interceptors in Flume
In this Apache Flume Tutorial, we talk about Apache Flume interceptors. Interceptors in Flume are those who have the capability to modify/drop events in-flight. So, in this blog, we will learn the whole concept of Apache Flume interceptors.
Also, we will see several types of Interceptors in Flume: Host Flume Interceptors, Morphline Interceptor, Flume Interceptors Regex Extractor, Regex Filtering Interceptor, Remove Header Interceptor, Search and Replace Interceptor, Static Interceptor, Timestamp Interceptors, and UUID Interceptor to understand this topic well.
Moreover, we will cover interceptors in Apache Flume examples and search the answer to this question, how to add interceptor in Flume to learn it more clearly.
What is Flume Interceptors
Basically, we can modify/drop events in-flight with the help of Apache Flume. It has the capability. So, this process takes place with the help of interceptors in Flume.
Moreover, they are the classes that implement org.apache.flume.interceptor.Interceptor interface. Also, can modify or even drop events based on any criteria chosen by the developer of the interceptor.
In addition, Apache Flume supports chaining of interceptors. It is only possible through by specifying the list of interceptor builder class names in the configuration. Although, in the source configuration Flume interceptors are specified as a whitespace separated list.
However, the order in which they are invoked, is the order in which the interceptors are specified. They are named components. So let’s see an example of how we create Flume Interceptors through configuration:
For example,
a1.sources = r1
a1.sinks = k1
a1.channels = c1
a1.sources.r1.interceptors = i1 i2
a1.sources.r1.interceptors.i1.type = org.apache.flume.interceptor.HostInterceptor$Builder
a1.sources.r1.interceptors.i1.preserveExisting = false
a1.sources.r1.interceptors.i1.hostHeader = hostname
a1.sources.r1.interceptors.i2.type = org.apache.flume.interceptor.TimestampInterceptor$Builder
a1.sinks.k1.filePrefix = FlumeData.%{CollectorHost}.%Y-%m-%d
a1.sinks.k1.channel = c1
It is very important to note that Flume interceptor builders are passed to the type config parameter.
Types of Interceptors in Flume
There are 9 types of Flume Interceptors, let’s discuss them one by one:
i. Timestamp: Apache Flume Interceptors
While it comes to Timestamp Flume interceptor, it inserts into the event headers, the time in millis at which it processes the event. Moreover, we can say it inserts a header with the key timestamp or as specified by the header property, whose value is the relevant timestamp.
Also, make sure if it is already present in the configuration, this interceptor can preserve an existing timestamp. The below table shows the property name and description of Timestamp Flume Interceptors.
Table.1 – Apache Flume Interceptor
Property Name | Default | Description |
type | The component type name has to be timestamp or the FQCN | |
header timestamp | The name of the header in which to place the generated timestamp. | |
preserveExisting | false | If the timestamp already exists, should it be preserved – true or false |
So, let’s see an example for agent a1:
a1.sources = r1
a1.channels = c1
a1.sources.r1.channels = c1
a1.sources.r1.type = seq
a1.sources.r1.interceptors = i1
a1.sources.r1.interceptors.i1.type = timestamp
ii. Host: Apache Flume Interceptors
Basically, it inserts the hostname or IP address of the host that this agent is running on. Moreover, with the key host or a configured key (whose value is the hostname or IP address of the host) on the basis of configuration, it inserts a header.
The Below table shows the property name and description of the property of Host Flume Interceptors.
Table.2 – Apache Flume Interceptor
Property Name | Default | Description |
type | The component type name has to be host | |
preserveExisting | false | If the host header already exists, should it be preserved – true or false |
useIP | true | Use the IP Address if true, else use hostname. |
hostHeader | host | The header key to be used. |
Also, see an example for agent a1:
a1.sources = r1
a1.channels = c1
a1.sources.r1.interceptors = i1
a1.sources.r1.interceptors.i1.type = host
iii. Static: Apache Flume Interceptors
While it comes to append a static header with static value to all events, it is possible with the Static Flume interceptors.
Below table shows the property name and description of the property of Static Flume Interceptors.
Table.3 – Apache Flume Interceptor
Property Name | Default | Description |
type | The component type name has to be static | |
preserveExisting | true | If configured header already exists, should it be preserved – true or false |
key | key | Name of the header that should be created |
value | value | Static value that should be created |
So, let’s see an example for agent a1:
a1.sources = r1
a1.channels = c1
a1.sources.r1.channels = c1
a1.sources.r1.type = seq
a1.sources.r1.interceptors = i1
a1.sources.r1.interceptors.i1.type = static
a1.sources.r1.interceptors.i1.key = datacenter
a1.sources.r1.interceptors.i1.value = NEW_YORK
Note: It does not allow specifying multiple headers at one time in its current implementation. Despite that, as a user, we can chain multiple static interceptors each defining one static header.
iv. Remove Header: Apache Flume Interceptors
However, by removing one or many headers, this interceptor manipulates Flume event headers. By these Flume interceptors, We can remove a statically defined header. Either header based on a regular expression or headers in a list.
Although, make sure the Flume events are not modified, if none of these is defined, or if no header matches the criteria. Below table shows the property name and description of the property of Remove Header Flume Interceptors.
Table.4 – Apache Flume Interceptor
Property Name | Default | Description |
type | The component type name has to be remove_header | |
withName | Name of the header to remove | |
fromList | List of headers to remove, separated with the separator specified by fromListSeparator | |
fromListSeparator | \s*,\s* | Regular expression used to separate multiple header names in the list specified by fromList. Default is a comma surrounded by any number of whitespace characters |
matching | All the headers which names match this regular expression are removed |
Note: Since, we need to remove only one header, specifying it by name provides performance benefits over the other 2 methods.
v. UUID: Apache Flume Interceptors
With the use of UUID Flume interceptors, we generally set a universally unique identifier on all events those are intercepted. Let’s see an example, UUID is b5755073-77a9-43c1-8fad-b7a586fc1b97, that represents a 128-bit value.
In addition, to automatically assign a UUID to an event consider using UUIDInterceptor. Since no application level, the unique key for the event is available. As soon as they enter the Flume network; that is, in the first Flume Source of the flow, it is important to assign UUIDs to events.
In the face of replication and redelivery in a Flume network, this enables subsequent deduplication of events that are designed for high availability and high performance. Moreover, this is preferable over an auto-generated UUID if an application level key is available.
Since it enables subsequent updates and deletes of the event in data stores using said well-known application level key. Below table shows the property name and description of the property of UUID Flume Interceptors.
Table.5 – Apache Flume Interceptor
Property Name | Default | Description |
type | The component type name has to be org.apache.flume.sink.solr.morphline.UUIDInterceptor$Builder | |
headerName | id | The name of the Flume header to modify |
preserveExisting | true | If the UUID header already exists, should it be preserved – true or false |
prefix | The prefix string constant to prepend to each generated UUID |
vi. Morphline: Apache Flume Interceptors
While it comes to filters the events through a morphline configuration file we use Morphline Interceptor. Basically, that defines a chain of transformation commands that pipe records from one command to another.
Let, understand this with an example, via regular expression based pattern matching the morphline can ignore certain events or alter or insert certain event headers. Also, via Apache Tika on events that are intercepted, it can auto-detect and set a MIME type.
For example, in a Flume topology, we can use this kind of packet sniffing for content-based dynamic routing. Moreover, we can also use MorphlineInterceptor to implement dynamic routing for multiple Apache Solr collections. Such as multi-tenancy.
Below table shows the property name and description of the property of Morphline Flume Interceptors.
Table.6 – Apache Flume Interceptor
Property Name | Default | Description |
type | The component type name has to be org.apache.flume.sink.solr.morphline.MorphlineInterceptor$Builder | |
morphlineFile | The relative or absolute path on the local file system to the morphline configuration file. Example: /etc/flume-ng/conf/morphline.conf | |
morphlineId | null | Optional name used to identify a morphline if there are multiple morphlines in a morphline config file |
Sample flume.conf file:
a1.sources.avroSrc.interceptors = morphlineinterceptor
a1.sources.avroSrc.interceptors.morphlineinterceptor.type= org.apache.flume.sink.solr.morphline.MorphlineInterceptor$Builder
a1.sources.avroSrc.interceptors.morphlineinterceptor.morphlineFile= /etc/flume-ng/conf/morphline.conf
a1.sources.avroSrc.interceptors.morphlineinterceptor.morphlineId = morphline1
Note: However, the morphline of an interceptor must not generate more than one output record for each input event, currently there is a restriction in that.
Moreover, we can not use these Flume interceptors for heavy duty ETL processing. Though, if you need this consider moving ETL processing from the Flume Source to a Flume Sink. Such as to a MorphlineSolrSink.
vii. Search and Replace: Apache Flume Interceptors
However, on the basis of Java regular expressions, it offers simple string-based search-and-replace functionality. Also, Backtracking/group capture is available.
Moreover, there are same rules used by this interceptor as in the Java Matcher.replaceAll() method. The below table shows Property name and description of Search and replace Flume interceptors.
Table.7 – Apache Flume Interceptor
Property Name | Default | Description |
type | The component type name has to be search_replace | |
searchPattern | The pattern to search for and replace. | |
replaceString | The replacement string. | |
charset UTF-8 | The charset of the event body. Assumed by default to be UTF-8. |
So,let’s read an example configuration:
a1.sources.avroSrc.interceptors = search-replace
a1.sources.avroSrc.interceptors.search-replace.type = search_replace
# Remove leading alphanumeric characters in an event body.
a1.sources.avroSrc.interceptors.search-replace.searchPattern = ^[A-Za-z0-9_]+
a1.sources.avroSrc.interceptors.search-replace.replaceString =
Another example:
a1.sources.avroSrc.interceptors = search-replace
a1.sources.avroSrc.interceptors.search-replace.type = search_replace
# Use grouping operators to reorder and munge words on a line.
a1.sources.avroSrc.interceptors.search-replace.searchPattern = The quick brown ([a-z]+) jumped over the lazy ([a-z]+)
a1.sources.avroSrc.interceptors.search-replace.replaceString = The hungry $2 ate the careless $1
viii. Regex Filtering: Apache Flume Interceptors
While it comes to filter events selectively we use Regex Filtering Interceptor. Basically, it is possible by interpreting the event body as text and matching the text against a configured regular expression.
Moreover, to include events or exclude events we can use the supplied regular expression. The following table shows property name along with descriptions of Regex Filtering Flume Interceptors.
Table.8 – Apache Flume Interceptor
Property Name | Default | Description |
type | The component type name has to be regex_filter | |
regex | ”.*” | Regular expression for matching against events |
excludeEvents | false | If true, regex determines events to exclude, otherwise, regex determines events to include. |
ix. Flume Regex Extractor: Apache Flume Interceptors
While it comes to extracts regex match groups we use Regex Extractor. Basically, it is only possible by using a specified regular expression and appends the match groups as headers on the event.
Also, for formatting the match groups before adding them as event headers it supports pluggable serializers. A table shows Regex Extractor Flume Interceptors.
Table.8 – Apache Flume Interceptor
Property Name | Default | Description |
type | The component type name has to be regex_extractor | |
regex | Regular expression for matching against events | |
serializers | Space-separated list of serializers for mapping matches to header names and serializing their values. (See example below) Flume provides built-in support for the following serializers: org.apache.flume.interceptor.RegexExtractorInterceptorPassThroughSerializer org.apache.flume.interceptor.RegexExtractorInterceptorMillisSerializer | |
serializers.<s1>.type | default | Must be default (org.apache.flume.interceptor.RegexExtractorInterceptorPassThroughSerializer), org.apache.flume.interceptor.RegexExtractorInterceptorMillisSerializer, or the FQCN of a custom class that implements org.apache.flume.interceptor.RegexExtractorInterceptorSerializer |
serializers.<s1>.name | ||
serializers.* | Serializer-specific properties |
Apache Flume Interceptors – Conclusion
Hence, in this Apache Flume tutorial, we have studied the whole concept of Flume Interceptors. Also, we have seen several types of Flume Interceptors, Timestamp Interceptor, Host Interceptor, Static Interceptor, Remove Header Interceptor, UUID Interceptor, Morphline Interceptor, Search and Replace Interceptor, Regex Filtering Interceptor, Regex Extractor Interceptor.
Moreover, we have seen Apache Flume Interceptors examples to completely understand this topic. Futhermore, if you have any doubt, please ask through the comment section.
Your opinion matters
Please write your valuable feedback about DataFlair on Google