- 1. Objective
- 2. Introduction to Various Data Analytics Tools
- 3. The Evolution of Analytic Approaches
- 4. Ensemble method of data analytics
- 5. Commodity Modeling
- 6. Text analytics method of data analysis
- 7. The Evolution of Analytical Tools
- 8. Categories of Analytical Tools
- 9. Some Popular Data Analytical Tools and Techniques
- 10. Difference between R vs SAS vs SPSS
In this data analytical tools tutorial, we are going to learn evolution of various analytical approaches, various categories of Big data analytics tools, features of R Programming tool and importance of R, features of IBM SPSS tool, various features and importance of the SAS tool and then we will learn comparison between R vs SAS vs SPSS to get clear understanding of which tool to use in various situations.
2. Introduction to Various Data Analytics Tools
Analytic professionals have used a range of tools over the years, which enabled them to prepare data for analysis, execute analytic algorithms, and assess the results. With time, there has been an increase in the depth and functionality of these tools. In addition to much richer user interfaces, tools now automate or streamline common tasks. As a result, analytic professionals end up with more time to focus on analysis. Combining new tools and methods with the evolved scalability and processes will help the organizations tame Big Data.
3. The Evolution of Analytic Approaches
Now we are going to understand evolution of analytic approaches.
Many common analytical and modeling approaches have been in use for years. Some approaches, such as linear regression or decision trees, are effective and relevant, but relatively simple to implement. In the earlier times, given tight limits on both tool availability and scalability, simplicity was necessary. Today, however, much more is possible.
Modern technology has led to increased volumes of data. To tackle these volumes, advanced analysis techniques have also been developed. With the help of these techniques, data can be analyzed accurately and properly.
Analytical approaches such as linear regression or decision trees, which were developed earlier, have been replaced by advanced analytical methods with the help of modern technology due to growth in data volume and need for advance analysis.
A few analytic methods are:
a. Ensemble Methods
The power of ensemble models stems from different techniques presenting different strengths and weaknesses
b. Commodity Modeling
It aims to improve over where you would end up without any model at all. A commodity modeling process stops when something good enough is found.
c. Text Data Analysis
One of the most rapidly growing methods utilized by organizations today is the analysis of text and other unstructured data sources.
Let us learn the above three methods in detail below.
4. Ensemble method of data analytics
Ensemble approaches are fairly straight forward conceptually. Instead of building a single model with a single technique, multiple models are built using multiple data analytics techniques. The results obtained from all the models are combined together to come up with a final answer.
The process of combining the various results can be anything from a simple average of each model’s predictions to a much more complex formula. It is important to note that ensemble models go beyond picking the best individual performer from a set of models.
The power of ensemble models stems from different techniques presenting different strengths and weaknesses. Certain types of customers, for example, may be scored poorly by one technique but very well by another. By combining intelligence from multiple models, a scoring algorithm becomes better in aggregate, if not literally for every individual customer, product, or store location.
One reason ensemble models are gaining traction is that the theory of the Wisdom of crowds behind them is easy to understand. This is like how many people making a prediction can produce an average answer that is very close to the correct one. This phenomenon is often called the Wisdom of Crowds. Thus, the results obtained using ensemble methods are accurate and risk free.
5. Commodity Modeling
Now we are going to learn commodity method.
Commodity model does not build the best model, but provides a model that will lead to better results. They aim to improve over where you would end up without any model at all. That is a lower bar to cross than most models have historically attempted to clear. A commodity modeling process stops when something good enough is found. Such a process makes a lot of sense for low-value problems or situations where too many models are required to pragmatically make each the best it can be.
Commodity models might be done via a simple stepwise analysis procedure, mostly on an automated basis. They enable the application of advanced analytics to a much wider scope of problems and scale within an organization than is possible via the path of having analytic professionals manually build a model.
In evaluating a commodity model, the primary concern is that there is a benefit being achieved by using it. There may be much room for improvement if more effort was put in. But, if a quick model can help in a situation that otherwise would not have a model, it is utilized.
Let us explore an analogy. If you own a home, there are some improvements where you put in only the best. Renovating a visible room like the kitchen is one area that often warrants a top-notch job. For some other improvements, you just get the job done. Perhaps when remodeling the guest bathroom, you are willing to settle for mediocre materials and fixtures. The guest bathroom just is not worth a huge investment. Commodity models help in similar situations for a business and have a wide range of uses.
6. Text analytics method of data analysis
Let us now understand text analysis method of data analytics.
Text analysis involves analysis of text and other unstructured data sources. The source of data for the analysis can be varied, ranging from books, e-mails to voice recordings of users.
Almost all organizations today are keen to understand the customer’s voice. Information, such as e-mails to the company, customer satisfaction surveys, call center notes, and other documents hold a lot of information about customer concerns and sentiments. Text analytics can be used to identify and address reasons of customer dissatisfaction. It can also help improve brand image by proactively solving problems before they become a sticking point with customers. Text analysis can help to identify and address causes of customer dissatisfaction in a timely manner.
Text analysis also helps in fraud detection. Popular commercial text analysis tools include those offered by Attensity, clara bridge, SAS, and SPSS.
Typically, unstructured data itself is not analyzed. Rather, unstructured data is processed in a way that applies some sort of structure to it. Very few analytical processes analyze and draw inferences directly from data in an unstructured form.
7. The Evolution of Analytical Tools
Let us see how data analytical tools evolved.
The development of tools started from tools with no user interfaces to tools with sophisticated interfaces. Advanced analytics is not confined to only analysis of data.
In late 1980s, analytics was not user friendly and the tools or systems were not available for analysis. All analytics work was done against a mainframe. Not only was there no choice but to directly get into program codes to do analytics, but it was also necessary to use the dreaded job control language (JCL).
Over time, additional graphical interfaces were developed that enabled users to do a lot through point-and-click environments, rather than coding. Virtually all commercially available analytic tools had such interfaces available by the late 1990s. User interfaces have since improved to include more robust graphics, visual workflow diagrams, and applications focused on specific point solutions.
Post 2000s, all the analytical tools which were available commercially had interfaces and the graphical representation became more sophisticated. There are now tools to manage deployment of analysis, to manage and administer the analytic servers and software that analytic professionals utilize, and to convert code from one language to another. A number of commercial analytics packages are also available today. Although the market leaders are SAS and SPSS, many other advanced analytics software tools are also available. Many are niche tools that address certain specific areas.
8. Categories of Analytical Tools
There are basically 2 types of analytical tools:
a. Statistical data analysis Tool
All commercial analytical tools come with graphical user interfaces. With the help of the evolved tools, the focus has shifted from coding to utilities. With the use of packages like point solutions, tasks can be accomplished very easily.
GUI is robust, bug free, optimized and allows analytic process development at a pace that equals or exceeds hard coding. Real analytic professionals do whatever is best to get a job done as accurately and efficiently as possible. Tools can help analytic professionals be more efficient while freeing up time to focus on analysis methods instead of writing code.
One big risk with user interfaces overlaps with one of their key strengths. It is easy to generate code in user interface; however, the ability to generate code quickly also makes it easy to generate bad code quickly. If a user is not proficient, he or she can accidentally create code through a user interface that is doing something totally different from what was intended.
b. Data Visualization Tool
The results obtained from the analysis of the data need to represented in forms that are useful for the user. Visualization tools enable professionals to create an interactive, visual analytics. An analytic professional will routinely need to explain complex analytical results to non-technical business people. Anything that can help this to be done more effectively is a good thing. Data visualization falls into this category. Many people would rather see a visual depiction of a decision tree model than a long list of business rules. This is where visualization helps.
9. Some Popular Data Analytical Tools and Techniques
Let us see some top data analysis tools for business:
a. The R Project
R was initially developed by Robert Gentleman and Ross Ihaka and is a descendent from the original “S” which was an early language for statistical analysis.
R is a free, open-source analytics package that competes directly with, as well as complements, commercial analytic tools.
Features of R:
- R has stronger object-oriented programming facilities than most statistical computing languages.
- R is easily extensible through functions and extensions and can be linked with common programming platforms like C++ and Java, which makes it possible to embed R within applications.
- Most commercial analytic tools have enabled R to be executed within their toolsets.
- Major advantage of R is its extensibility. Developers can easily write their own software and distribute it in the form of add-on packages. Because of the relative ease of creating these packages, thousands of R packages exist. Many new statistical methods are also published with an R package attached.
R analytics tool has picked up a lot of steam and is now used by a large number of analytic professionals. This is especially true in the academic and research environments. It tends to be used for research and development activities rather than large-scale, critical production analytic processes Within a corporate environment today, if there is a large team of analytics talent, it is often the case that at least a few members of the team are using R in some way.
Limitations of R
- Scalability: One of the disadvantages of R is its scalability. Some improvements have been made recently, but R still is not able to scale to the level of other commercial tools and databases. The base R software runs in memory as opposed to running against files.
- R handles datasets the size of the memory available on the machine. Most machines do not support large memories and hence face issues in working in R. The amount of memory in even a very expensive machine is far less than required for handling enterprise-level datasets, let alone Big Data; thus, if a large organization wants to tame Big Data, R can be a piece of the solution, but it will not realistically be the only piece of the solution based on where it sits today.
- Programming in R is also a fairly intensive process. Although there are some graphical interfaces that sit on top of R, many users today still primarily write code. R interfaces are less mature than interfaces for other commercial tools.
The SPSS analytical tool was first introduced in 1968. Its name changed to IBM SPSS Statistics in 2009, after the acquisition of the SPSS business by the IBM Group. It has a user-friendly GUI.
The software name stands for Statistical Package for the Social Sciences (SPSS), reflecting the original market, although the software is now popular in other fields as well, including the health sciences and marketing. SPSS does not have all the functionality of R, but its syntax and database format are compatible with R, and it can handle large volumes of data.
The main window of IBM SPSS Statistics, the data editor, looks like a spreadsheet in which you can input data directly.
Features of SPSS:
- SPSS commands are executed line by line to update tables or add results to the output editor window. This window also provides an option for storing the executed syntaxes with their execution times.
- SPSS can read from and write to ASCII files, databases and tables of other statistical software. SPSS Statistics can read and write to external relational database tables via ODBC and SQL. It provides data management functions, such as sorting, aggregation, transposition, and table merge.
- IBM SPSS statistics can send the output directly to a file, instead of the Output Editor window.
- IBM SPSS statistics is available in several environments, including Windows, Mac OS X, and Unix.
- IBM SPSS Statistics can also display a graphic produced by R in its Output window.
Statistical Analysis System (SAS) was founded in 1976 in the IBM mainframe world to handle large data volumes. Its capacity to handle data increased with the implementation of a parallel architecture in 1996.
SAS is a software suite that can be used to mine, alter, manage and retrieve data from a variety of sources and perform statistical analysis on it. It is an information delivery system that is used to represent a modular, integrated, and hardware independent computing package. It provides a broad and independent environment for organizational database; therefore, data analysts can easily transform datasets into useful information that helps them in the decision-making.
A SAS program consists of DATA steps, procedure steps, and macros, if required. Several procedures provide a comprehensive range of functions (statistics, graphics, utilities, and such), while the DATA step enables the user to open files (or import databases), read each record in turn, write to another file (or export to a database), merge a number of files, and close the files.
Features of SAS:
- Statistics – SAS statistics offer a variety of statistical software that includes modified traditional analysis and dynamic data visualization approaches. It also helps organization to maintain their customers.
- Data and Text Mining – Business organizations collect large amount of data from various sources. They use this collected data for data mining and text mining to develop new strategies and take better decisions.
- Data Visualization – SAS data visualization provides a user-friendly interface to the advanced analytic capabilities of SAS. It develops data analysis with effective data visualization.
- Forecasting – SAS supports all types of data forecasting and analysis essentials for short and long terms. Forecasting tools help analyze and forecast processes, when required.
- Optimization – SAS tools provide optimization, project scheduling, and simulation techniques to achieve maximum results while operating within restrictions and limited resource.
SAS is designed to deliver universal data access. It provides a good user interface and increase the functionalities of applications in software. SAS analysis provides a variety of analysis procedures that helps users navigate through data; hence, the most concise information in data is read clearly and analyzed successively.
SAS products, commonly known as modules, are mostly used by social and behavioral scientists. These modules allow them to perform various types of functions, such as spreadsheet analysis, data access, statistical analysis, applications construction, and management. The SAS products can be sold separately or in sets. SAS solutions offer a number of techniques and processes for guided decision making.
10. Difference between R vs SAS vs SPSS
Let us see comparison between the three analytical tools seen above:
a. User Interface
SAS has the most interactive and user friendly interface followed by SPSS which supports a moderately interactive GUI. R has the least interactive analytical tool but editors are available for providing GUI support for programming in R. However, for learning and practicing hands-on analytics, R is an excellent tool as it really helps analysts master the various analytics steps and commands.
b. Decision Making
IBM SPSS Statistics also has an advantage over SAS not only in its lower price, but also in the possibility of obtaining Answer tree for decision trees without having to buy the data mining suite. Anyone wanting to construct decision trees with SAS has to buy Enterprise Miner. For decision trees, IBM SPSS is also more competitive than R, which does not offer many tree algorithms. Most of the packages only implement CART, and their interface is not as user friendly.
c. Data Management
In data management, SAS has an edge over IBM SPSS and is somewhat better than R. A major drawback of R is that most of its functions have to load all the data into memory before execution, which sets a limit on the volumes that can be handled; however, some packages are beginning to break free of this constraint. One example is the biglmpackage for linear models.
In terms of documentation, R has easily available elaborate documentation files while SPSS lacks this feature due to its limited use. SAS has a comprehensive technical documentation of more than 8000 pages.
Because SAS is more widely used in big enterprises than IBM SPSS Statistics, it has more sources and resources devoted to it, such as forums, user clubs, trainers, websites, macro libraries, and books. The R community however is one of the strongest open-source communities. SAS offers many predefined functions, such as mathematical and financial functions, than IBM SPSS Statistics. These include depreciation, compound interest, cash flow, hyperbolic functions, factorials, combinations and arrangements, and others.