SQL Data Profiling in SAP HANA – Grasp the Complete Process!
In this tutorial we will be talking about SQL data profiling in SAP HANA. We will learn the basics of data profiling and understand what is it? Later, we’ll understand some more technical aspects of it. SQL data profiling is an important task in data preparation before loading the data into the data warehouse.
Don't become Obsolete & get a Pink Slip
Follow DataFlair on Google News & Stay ahead of the game
SAP HANA SQL Data Profiling
SQL data profiling is a process of refining the data and analyzing it to get a better understanding of it. The main purpose behind data profiling is to check data coming from different sources, identify problems in it and fix the issues to maintain data quality. Fixing the problems with data quality includes removing incorrect data, removing incomplete data, and prevent data quality issues. Thus, the data profiling helps to get a metrics from a data set so that the user can get a thorough understanding of the data, it’s structure and quality.
The data profiling task can only be done with the data sources from SQL Server and not with any other third-party data source. Data profiling task in SAP HANA can be done by using stored profiles from SQL Server Integration Services package.
1. Data Profiling Task Setup
To set up the data profiling task, you must execute the package having statistics profiling project (that computes profiles). You have to save the output of the execution in an XML format in a file or a package variable.
You can view the statistics profiles and share the output through Statistics Profile Viewer. You can download the SPV software individually. It shows the statistics profile summary as well as a detailed profile.
2. Data Profiling Configuration
When a data profiling task is configured, you can use wildcard columns, i.e. using “*” sign in place of columns having unfamiliar data. Upon project execution, the data profiling task will identify the characteristics such as data type for the unfamiliar column.
Another option for configuring a data profiling task is by selected short profiles that are Quick Profiles. You have to profile the quick profiles using all default profiles and settings of tables and views.
The data profiling task ideally profiles eight different data profiles. Where five profiles evaluate individual columns, the other three profiles evaluate multiple columns and association between them.
3. Data Profiling Task Output
The output or results of data profiling tasks executed on selected number of data profiles are stored in an XML format and layout in the DataProfile.xsd schema.
You must have a look at SQL Operators in SAP HANA
Benefits of Data Profiling in SAP HANA
Some key benefits of data profiling are:
- It enables the users to analyse data from different data sources.
- The users can understand the source data better and comprehend it thoroughly.
- You can use the information provided by data profiling in ETL process.
- You can refine the data by removing incorrect and incomplete information which enhances the data quality before loading it into data warehouse.
- Data profiling improves SAP HANA’s capability to search data by adding keywords and descriptions.
- Another major advantage of data profiling from a project cost point of view is that you can detect issues in data at a very early stage of the project. This saves a lot of time and cost of the company.
- Also, the implementation cycle of major projects by the important insights that data profiling provides about the data structure and characteristics.
Data Profiling in SAP HANA Studio
Now that we know what is data profiling and understood its significance, we must learn how to use the profiling capabilities in SAP HANA Studio. In SAP HANA Studio, we can conduct profiling for data in tables or data in views.
Follow the steps below to learn the process:
Step 1: Open SAP HANA Studio and login to your HANA system. Make sure you are working on the SAP HANA Modeler perspective.
Then drill down to the specific table or view against which you wish to do data profiling and analysis. You can access data tables under the Catalog node.
Step 2: Right-click on the name of the view or table and select the option Open Data Preview.
In the data preview window, you get three tabs; Raw Data, Distinct values, and Analysis tab.
Must Learn – SQL Data Types in SAP HANA
In the Raw Data tab, you can do several things, like add filters, add conditions for the filers, etc. For instance, we have a table containing customer information and we apply a filter on company name “DataFlair”. We can select from a range of predicates like equal to (=), not equal to, lower (<), etc. The table will show filtered results, i.e. only the data related to DataFlair.
You can save the resultant table and export the data to another system. Click on the save icon given at the top right to save the raw data table. You can save the table in three types of formats; Delimited text files, HTML file, and XML file.
The next tab is the Distinct Values tab where you can perform basic data profiling. On this window, you will get a list of available objects from your selected data source. You can select one object from the list, and it will show you all the details of it.
For instance, you can select an object COUNTRYNAME from the Available Objects section. It will show all the distinct values present in that object. The data is also represented in a bar chart representation and a statistics box at the bottom.
It also shows the count and distribution percentage for the occurrence of each value. This helps in detailed analysis of the data from the source.
Also, you can select an item and view its data type, native dataytype, java type, column display size, column precision, column scale and row count in the statistics box.
You can get each data column available in the list profiled instantly by clicking on its name.
Lastly, after data profiling, you can conduct visual analysis on the data from the Analysis tab. You can create visualizations by adding columns and measures into the label axis. Also, you can select from the different types of charts like bar chart, horizontal bar chart, pie chart, line chart, heat chart, etc.
There are several other types of visualizations such as tag cloud, box plot, radar, multiple radar chart, etc.
Also, you can view the data in four different forms i.e. as charts, tables, grids, and HTML tables.
Some additional features include being able to add the visualizations into favorites using the star icon on the top right bar. And, adding filters during visual analysis.
This concludes our tutorial on SQL data profiling in SAP HANA. Here, we learned how to organize data to optimize the performance of SAP HANA.
In case of any suggestions and queries, please drop your comments in the comment section below.
Dive into the sea of career opportunities in SAP HANA