What is Impala Troubleshooting & Performance Tuning
While we know how to diagnose and debug problems in Impala, that is what we call Impala Troubleshooting-performance tuning methods. So, in this article, “Impala Troubleshooting-performance tuning” we will study several ways of diagnosing and debugging problems in Impala.
So, let’s start Impala Troubleshooting / Known Issues.
Impala Troubleshooting & Performance Tuning
Basically, being able to diagnose and debug problems in Impala, is what we call Impala Troubleshooting-performance tuning.
It includes performance, network connectivity, out-of-memory conditions, disk space usage, and crash or hangs conditions in any of the Impala-related daemons. However, there are several ways, we can follow for diagnosing and debugging of above-mentioned problems.
So, let’s discuss Impala Troubleshooting-performance tuning methods one by one:
i. Impala Performance tuning
Here, are some factors which are affecting the performance of Impala features, and procedures. Especially, Â for tuning, monitoring, and benchmarking Impala queries and other SQL operations.
We will also explain techniques for maximizing Impala scalability. Like, Scalability. It is tied to performance. That implies performance remains high as the system workload increases. Let’s understand this with an example.
Reducing the disk I/O performed by a query can speed up an individual query, and at the same time improve scalability by making it practical to run more queries simultaneously.
Sometimes, an optimization technique improves scalability more than performance.
For example, reducing memory usage for a query might not change the query performance much, but might improve scalability by allowing more Impala queries or other kinds of jobs to run at the same time without running out of memory.
Note: Make sure your system is configured with all the recommended minimum hardware requirements, before starting any performance tuning or benchmarking, for Impala.
- Partitioning of Impala Tables
This technique allows queries to skip reading a large percentage of the data in a table also divides the data based on the different values infrequently queried columns.
- Performance Considerations for Join Queries
Basically, we can tune Joins. They are the main class of queries at the SQL level, as opposed to changing physical factors.
Such as the file format or the hardware configuration. However, Â Overview of Table Statistics and the related topics Overview of Column Statistics are also important primarily for join performance.
- Overview of Table Statistics and Overview of Column Statistics
By using the COMPUTE STATS statement, collecting table and column statistics. So, without requiring changes to SQL query statements, that helps Impala automatically optimize the performance for join queries.
Well, make sure that in Impala 1.2.2 and higher this process is greatly simplified. Since the COMPUTE STATS statement collects both kinds of statistics in one operation. Also, it does not require any setup and configuration as was previously necessary for the ANALYZE TABLE statement in Hive.
- Testing Impala Performance
Before conducting any benchmark tests, do some post-setup testing, in order to ensure Impala is using optimal settings for performance.
- Benchmarking Impala Queries
Basically, for doing performance tests, the sample data and the configuration we use for initial experiments with Impala is often not appropriate.
- Controlling Impala Resource Usage
We can expect the better query performance while the more memory Impala can utilize. We must make tradeoffs in a cluster running other kinds of workloads as well. It helps to make sure all Hadoop components have enough memory to perform well. So we might cap the memory that Impala can use.
- Using Impala with the Amazon S3 Filesystem
When the data is stored in Hadoop Distributed File System (HDFS), have different performance characteristics than Queries against data stored in the Amazon Simple Storage Service (S3).
ii. Impala Troubleshooting Quick Reference
There is the list of common problems and potential solutions in Impala. Such as:
- Impala takes a long time to start
Explanation: Since the metadata for Impala objects is broadcast to all Impalad nodes and cached. So, Â Impala instances with large numbers of tables, partitions, or data files take longer to start.
Recommendation: Try to adjust timeout and synchronicity settings.
- Joins fail to complete
Explanation: It occurs due to insufficient memory. Hence, Â data from the second, third, and so on sets to be joined is loaded into memory during a join. Also, note that the query could exceed the total memory available if Impala chooses an inefficient join order or join mechanism.
Recommendation: Start by gathering statistics with the COMPUTE STATS statement for each table involved in the join.
Consider specifying the [SHUFFLE] hint so that data from the joined tables are split up between nodes rather than broadcast to each node. If tuning at the SQL level is not sufficient, add more memory to your system or join smaller data sets.
- Queries return incorrect results
Explanation: After changes are performed in Hive, Impala metadata may be outdated.
Recommendation: Hence, use the appropriate Impala statement, rather than switching back and forth between Impala and Hive. Such as INSERT, LOAD DATA, CREATE TABLE, ALTER TABLE, COMPUTE STATS, and so on.
Moreover, Â Impala automatically broadcasts the results of DDL and DML operations to all Impala nodes in the cluster. Although, when such changes are made through Hive it does not automatically recognize.
- Queries are slow to return results
Explanation: It is possible that we cannot use native checksumming in Impala. Basically, to compute checksums over HDFS data very quickly Native checksumming uses machine-specific instructions.
So, review Impala logs. On reviewing once you find instances of “INFO util.NativeCodeLoader that means native checksumming is not enabled.
Recommendation: Just make sure Impala is configured to use native checksumming.
- Queries are slow to return results
Explanation: Like, to use data locality tracking, Impala may not be configured.
Recommendation: So, do make configuration changes, also test Impala for data locality tracking as necessary.
iii. Troubleshooting Impala SQL Syntax Issues
There are times when queries issued against Impala fail. Â At that times, we can try running these same queries against Hive.
However, once a query fails against both Impala and Hive, that is a problem with your query or other elements of your environment. Then such things:
- Â To ensure your query is valid to try reviewing the Language Reference.
- Also, check Impala Reserved Words. Since any database, table, column, or other object names in your query conflict with Impala reserved words. Further, Quote those names with backticks (“).
- Moreover, to confirm whether Impala supports all the built-in functions, check Impala Built-In Functions being used by your query. Also, check whether argument and return types are the same as you expect.
- For identifying the source of the problem review the contents of the Impala logs for any information that may be useful.
Well, in some scenario there is a problem with your Impala installation if a query fails against Impala but not Hive.
iv. Impala Web User Interface for Debugging
There is a built-in web server in each of the Impala daemons. Basically, that displays diagnostic and status information itself. Such as:
- The Impalad web UI (default port: 25000)
This daemon displays various information of queries. Such as configuration settings, running and completed queries, and associated performance and resource usage. Moreover, Â including a graphical representation of the plan, the Details link for each query displays alternative views of the query.
Also, the output of the EXPLAIN, SUMMARY, and PROFILE statements from impala-shell. However, this daemon is mainly for diagnosing query problems that can be traced to a particular node.
- The Statestored web UI (default port: 25010)
Moreover, it provides information through checks performed by this daemon. Such as memory usage, configuration settings, and ongoing health. Hence, we view the web UI only on the particular host that serves as the Impala Statestore, since there is only a single instance of this daemon within any cluster.
- The Catalogd web UI (default port: 25020)
The Catalogd web UI shows various information about the daemon itself. Such as databases, tables, and other objects managed by Impala, in addition to the resource usage and configuration settings. This information is displayed as the underlying Thrift data structures.
Also, we view the web UI only on the particular host that serves as the Impala Catalog Server, since there is only a single instance of this daemon within any cluster.
iv. Troubleshooting I/O Capacity Problems
However, Impala queries are typically I/O-intensive. Impala queries could show slow response times with no obvious cause on the Impala side since there is an I/O problem with storage devices, or with HDFS itself.
Because queries involving clauses, Slow I/O on even a single DataNode could result in an overall slowdown. Such as ORDER BY, GROUP BY, or JOIN do not start returning results until all DataNodes have finished their work.
So, in order to test whether the Linux I/O system itself is performing as expected. Do run Linux commands like the following on each DataNode:
$ sudo sysctl -w vm.drop_caches=3 vm.drop_caches=0 vm.drop_caches = 3 vm.drop_caches = 0
$ sudo dd if=/dev/sda bs=1M of=/dev/null count=1k 1024+0 records in 1024+0 records out 1073741824 bytes (1.1 GB) copied, 5.60373 s, 192 MB/s
$ sudo dd if=/dev/sdb bs=1M of=/dev/null count=1k 1024+0 records in 1024+0 records out 1073741824 bytes (1.1 GB) copied, 5.51145 s, 195 MB/s
$ sudo dd if=/dev/sdc bs=1M of=/dev/null count=1k 1024+0 records in 1024+0 records out 1073741824 bytes (1.1 GB) copied, 5.58096 s, 192 MB/s
$ sudo dd if=/dev/sdd bs=1M of=/dev/null count=1k 1024+0 records in 1024+0 records out 1073741824 bytes (1.1 GB) copied, 5.43924 s, 197 MB/s
Moreover, a throughput rate of less than 100 MB/s typically indicates a performance issue with the storage device, on modern hardware. Hence, before continuing with Impala tuning or benchmarking correct the hardware problem.
So, this was all about Impala Troubleshooting & Performance Tuning. Hope you like our explanation.
Conclusion – Impala TroubleShooting
As a result, we have seen various ways for diagnosing and debugging of problems in Impala. Still, if any doubt occurs, regarding Impala Troubleshooting-performance tuning, feel free to ask in the comment section.
Did you like our efforts? If Yes, please give DataFlair 5 Stars on Google
impala returned the results within 3 seconds. But the underlying query ran for out 70+ minutes. What could be the reason?
What will happen if the co-ordinate node goes down while query is executing ?