Impala DISTINCT Operator – Syntax & Example

1. Objective

In our previous tutorial, we studied the Impala Offset Clause. Now, we are going to study Impala DISTINCT Operator. Although, there is much more to learn about using Impala DISTINCT Operator.
So, let’s start Impala DISTINCT Operator.

Distinct Operator in Impala

If these professionals can make a switch to Big Data, so can you:
Rahul Doddamani Story - DataFlair
Rahul Doddamani
Java → Big Data Consultant, JDA
Follow on
Mritunjay Singh Success Story - DataFlair
Mritunjay Singh
PeopleSoft → Big Data Architect, Hexaware
Follow on
Rahul Doddamani Success Story - DataFlair
Rahul Doddamani
Big Data Consultant, JDA
Follow on
I got placed, scored 100% hike, and transformed my career with DataFlair
Enroll now
Deepika Khadri Success Story - DataFlair
Deepika Khadri
SQL → Big Data Engineer, IBM
Follow on
DataFlair Web Services
You could be next!
Enroll now

2. What is Impala DISTINCT Operator?

While we want to filter the results or  to remove duplicates, we use The DISTINCT operator in a SELECT statement:

— Returns the unique values from one column.
— NULL is included in the set of values if any rows have a NULL in this column.
select distinct c_birth_country from Employees;
— Returns the unique combinations of values from multiple columns.
select distinct c_salutation, c_last_name from Employees;
Moreover, to find how many different values a column contains, we can use DISTINCT in combination with an aggregation function.Typically COUNT():

— Counts the unique values from one column.
— NULL is not included as a distinct value in the count.

select count(distinct c_birth_country) from Employees;

— Counts the unique combinations of values from multiple columns.

select count(distinct c_salutation, c_last_name) from Employees;

However, make sure that using DISTINCT in more than one aggregation function in the same query is not supported by Impala SQL. To understand more, we could not have a single query with both COUNT(DISTINCT c_first_name) and COUNT(DISTINCT c_last_name) in the SELECT list.

3. Syntax for Distinct Operator in Impala

So, the syntax for using Impala DISTINCT Operator is-

select distinct columns… from table_name;

4. Zero-length strings

Impala considers zero-length strings (“”), NULL for purposes of clauses such as DISTINCT and GROUP BY. Also, space to all is different values.

Note:
As we discussed earlier, Impala only allows a single COUNT(DISTINCT columns) expression in each query.

Moreover, we can produce an estimate of the distinct values for a column by specifying NDV(column), if we do not need precise accuracy. Here, a query can contain multiple instances of NDV(column). Also, can enable the APPX_COUNT_DISTINCT query option, to make Impala automatically rewrite COUNT(DISTINCT) expressions to NDV().

Further, we can use the following technique for queries involving a single table in order to produce the same result as multiple COUNT(DISTINCT) expressions:

select v1.c1 result1, v2.c1 result2 from
 (select count(distinct col1) as c1 from t1) v1
   cross join
 (select count(distinct col2) as c1 from t1) v2;

By the way, prefer to use the NDV() technique while practical because CROSS JOIN is an expensive operation.

Note:
Make sure, Impala does not do any ordering of DISTINCT values in contrast with some database systems. Basically, that always return DISTINCT values in sorted order. So, to avoid that include an ORDER BY clause always, if we need the values in alphabetical or numeric sorted order.

Hadoop Quiz

5. Impala Distinct Operator – Example

Let’s discuss Impala DISTINCT Operator Example,

So, let us suppose we have a table named Employees in the database my_db. Its contents are −

[quickstart.cloudera:21000] > select distinct name, age, address from Employees;
Query: select distinct  name, age, address from Employees
NameAgeAddress
Shubham32Delhi
Shubham32Delhi
Monika25Mumbai
Kajal27Alirajpur
Revti25Indore
Revti25Indore
Shreyash23Pune
Mehul22Hyderabad

However, we can see that the name of the Employees Shubham and Revti entered twice. So,  now we can select the unique values using the Impala Distinct Operator.

[quickstart.cloudera:21000] > select distinct name, age, address from Employees;
On executing, the above query gives the following output.
Query: select distinct  name from Employees
NameAgeAddress
Shubham32Delhi
Monika25Mumbai
Kajal27Alirajpur
Revti25Indore
Shreyash23Pune
Mehul22Hyderabad

Fetched 9 row(s) in 1.46s
So, this was all about Impala DISTINCT Operator. Hope you like our explanation.

6. Conclusion

Hence, in this article, we have seen how to use Impala DISTINCT Operator properly. However, if you want to ask any doubt, feel free to ask in the comment section.we will definitely respond.
Related Topic- Top 50 Impala Interview Questions and Answers
For reference

Leave a Reply

Your email address will not be published. Required fields are marked *

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.