Impala DISTINCT Operator – Syntax & Example

Boost your career with Free Big Data Courses!!

In our previous tutorial, we studied the Impala Offset Clause. Now, we are going to study Impala DISTINCT Operator. Although, there is much more to learn about using Impala DISTINCT Operator.
So, let’s start Impala DISTINCT Operator.

What is Impala DISTINCT Operator?

While we want to filter the results or  to remove duplicates, we use The DISTINCT operator in a SELECT statement:

— Returns the unique values from one column.
— NULL is included in the set of values if any rows have a NULL in this column.
select distinct c_birth_country from Employees;
— Returns the unique combinations of values from multiple columns.
select distinct c_salutation, c_last_name from Employees;

Moreover, to find how many different values a column contains, we can use DISTINCT in combination with an aggregation function.Typically COUNT():

— Counts the unique values from one column.
— NULL is not included as a distinct value in the count.

select count(distinct c_birth_country) from Employees;

— Counts the unique combinations of values from multiple columns.

select count(distinct c_salutation, c_last_name) from Employees;

However, make sure that using DISTINCT in more than one aggregation function in the same query is not supported by Impala SQL. To understand more, we could not have a single query with both COUNT(DISTINCT c_first_name) and COUNT(DISTINCT c_last_name) in the SELECT list.

Syntax for Distinct Operator in Impala

So, the syntax for using Impala DISTINCT Operator is-

select distinct columns… from table_name;

Zero-length strings

Impala considers zero-length strings (“”), NULL for purposes of clauses such as DISTINCT and GROUP BY. Also, space to all is different values.

As we discussed earlier, Impala only allows a single COUNT(DISTINCT columns) expression in each query.

Moreover, we can produce an estimate of the distinct values for a column by specifying NDV(column), if we do not need precise accuracy. Here, a query can contain multiple instances of NDV(column).

Also, can enable the APPX_COUNT_DISTINCT query option, to make Impala automatically rewrite COUNT(DISTINCT) expressions to NDV().

Further, we can use the following technique for queries involving a single table in order to produce the same result as multiple COUNT(DISTINCT) expressions:

select v1.c1 result1, v2.c1 result2 from
 (select count(distinct col1) as c1 from t1) v1
   cross join
 (select count(distinct col2) as c1 from t1) v2;

By the way, prefer to use the NDV() technique while practical because CROSS JOIN is an expensive operation.

Make sure, Impala does not do any ordering of DISTINCT values in contrast with some database systems. Basically, that always return DISTINCT values in sorted order. So, to avoid that include an ORDER BY clause always, if we need the values in alphabetical or numeric sorted order.

Impala Distinct Operator – Example

Let’s discuss Impala DISTINCT Operator Example,

So, let us suppose we have a table named Employees in the database my_db. Its contents are −

[quickstart.cloudera:21000] > select distinct name, age, address from Employees;
Query: select distinct  name, age, address from Employees

However, we can see that the name of the Employees Shubham and Revti entered twice. So,  now we can select the unique values using the Impala Distinct Operator.

[quickstart.cloudera:21000] > select distinct name, age, address from Employees;
On executing, the above query gives the following output.
Query: select distinct  name from Employees

Fetched 9 row(s) in 1.46s
So, this was all about Impala DISTINCT Operator. Hope you like our explanation.


Hence, in this article, we have seen how to use Impala DISTINCT Operator properly. However, if you want to ask any doubt, feel free to ask in the comment section. we will definitely respond.

Did you know we work 24x7 to provide you best tutorials
Please encourage us - write a review on Google

follow dataflair on YouTube

Leave a Reply

Your email address will not be published. Required fields are marked *