# Data Mining Terminologies and Predictive Analytics Terms

## 1. Data Mining Terms – Objective

In this **Data Mining Tutorial**, we will study Data Mining Terminologies. We will cover each and every Data Mining Terminologies related to every domain. Moreover, we will discuss some predictive analytics terms used in Data Mining.

So, let’s start Data Mining Terminologies.

## 2. Data Mining Terminologies

Let’s begin data minin terminologies:

**i. Data Mining**

We use data mining to extract the information from a huge set of data. Also, we can use this information for any of the following applications −

- Market Analysis

- Fraud Detection

- Customer Retention

- Production Control

- Science Exploration

**ii. Data Mining Engine**

It is very important to data mining system. Also, it consists too many set of function modules. That perform the following functions.

- Characterization

- Association and Correlation Analysis

- Classification

- Prediction

- Cluster analysis

- Outlier analysis

- Evolution analysis

**iii. Knowledge Base**

We can say this is the domain knowledge. We use this to guide the search.

- Knowledge Discovery

- Cleaning of data

- Data Integration

- Selection of data

- Transformation of data

- Data Mining

- Pattern Evaluation

- Knowledge Presentation

**iv. User Interface**

- Interact with the system by specifying a data mining query task.

- Providing information to help focus the search.

- Mining based on the intermediate data mining results.

- Browse database and data warehouse schemas or data structures.

- Evaluate mined patterns.

- Visualize the patterns in different forms.

**v. Data Integration**

It is a data pre-processing technique. We use it to merge the data from multiple heterogeneous data sources into a coherent data store. Also, it involves inconsistent data and therefore needs data cleaning.

**vi. Associations**

An association is a type of an algorithm. We use it to create rules that describe how often events have occurred together.

**vii. Backpropagation**

It is a type of a training method. Also, we use it to calculate the weights in a neural net from the data.

**viii. Binning**

It is a type of data preparation activity. As we use data mining to convert continuous data to discrete data. Also, to convert it we need to replace a value from a continuous range with a bin identifier.

**ix. CART**

CART refers to Classification And Regression Trees. As in this method, we have to split the independent variables into small groups. And, fitting a constant function to the small data sets. Although, the constant function is one that takes on a finite small set of values. While in regression trees, the mean value of the response is fit to small connected data sets.

**x. Categorical data**

Generally, categorical data fits into a small number of discrete categories. Also, Categorical data is defined in a particular way. That is either non-ordered such as gender or city, or ordered such as high, medium, or low temperatures.

**xi. CHAID**

Basically, it’s an algorithm. That we use for fitting categorical trees. Also, it relies on the chi-squared statistic to split the data into small connected data sets.

**xii. Chi-squared**

Chi-Square is defined as a statistic assesses that defines how well a model fits the data. Also, we use it in data mining to find homogeneous subsets for fitting categorical trees as in CHAID.

**xiii. Classification**

It refers to the data mining problem. Also, we have to predict the category of categorical data by building a model. That model must base on some predictor variables.

**xiv. Classification tree**

A decision tree that places categorical variables into classes.

**xv. Cleaning (cleansing)**

It is a process of preparing data for a data mining activity. Obvious data errors are detected and corrected and missing data is replaced.

**xvi. Confusion matrix**

We use this matrix shows that counts of the actual versus predicted class values. It shows not only how well the model predicts but also presents the details needed to see exactly where things may have gone wrong.

**xvii. Consequent**

Whenever an association between two variables is defined, the second item is called the consequent.

**xviii. Continuous**

Continuous data can have any value in an interval of real numbers. That is, the value does not have to be an integer. Continuous is the opposite of discrete or categorical.

**xix. Cross-validation**

A method of estimating the accuracy of a classification or regression model. The data set is divided into several parts, with each part in turn used to test a model fitted to the remaining parts.

**xx. Data**

Data is defined as facts, transactions, and figures.

**xxi. DBMS**

It refers to database management systems.

**xxii. Data format**

Data items can exist in many formats such as text, integer, and floating-point decimal. The form of the data in the database is data format.

**xxiii. Decision Tree**

We use it to represent a collection of hierarchical rules that lead to a class or value.

**xxiv. Data Mining method**

In this, procedure and algorithms are designed to analyze the data in databases.

**xxv. Deduction**

Deduction infers information that is a logical consequence of the data.

**xxvi. Degree of fit**

A measure of how closely the model fits the training data. A common measure is r-square.

**xxvii. Dependent Variable**

These are the variables of the model. That need to be predicted by the equation of the model using the independent variables.

**xxviii. Deployment**

Once the model is trained and validated, then we use it to analyze new data and make predictions. Hence, use of the model is called deployment.

**xxixDimension**

Each attribute of a case or occurrence in the data being mined. Also, stored as a field in a flat file record or a column of a relational database table.

**xxx. Discrete**

A data item that has a finite set of values. Discrete is the opposite of continuous.

**xxxi.Discriminant analysis**

It a type of a statistical method that is based on maximum likelihood for determining boundaries. Boundaries must separate the data into categories.

**xxxii. Entropy**

A way to measure variability other than the variance statistic. Some decision trees split the data into groups based on minimum entropy.

**xxxiii. Exploratory Analysis**

Looking at data to discover relationships not previously detected. Exploratory analysis tools typically assist the user in creating tables and graphical displays.

**xxxiv. External Data**

In this, data is not collected by the organization. Such as data available from a reference book, a government source.

**xxxv.Feed-forward**

A neural net in which the signals only flow in one direction, from the inputs to the outputs.

**xxxvi. Fuzzy Logic**

Fuzzy logic is applied to fuzzy sets where membership in a fuzzy set is a probability, not necessarily 0 or 1. Non-fuzzy logic manipulates outcomes that are either true or false. Fuzzy logic needs to be able to manipulate degrees of “maybe” in addition to true and false.

**xxxvii. Genetic Algorithms**

A computer-based method of generating and testing combinations of possible input parameters. That need to find the optimal output. It uses processes based on natural evolution concepts. Such as genetic combination, mutation, and natural selection.

**xxxviii. GUI**

Graphical User Interface.

**xxxix. Independent variable**

These variables of a model are the variables used in the equation. That need to predict the output variable.

**xl. Induction**

A technique that infers generalizations from the information in the data.

**xli. Interaction**

It occurs only when two independent variables interact. Whenever changes in the value of one change the effect on the dependent variable of the other.

**xlii. Internal data**

Data collected by an organization such as operating and customer data.

**xliii. k-nearest neighbor**

In this, a classification method is present that classifies a point by calculating the distances between the points. Then it assigns the point to the class that is most common among its k-nearest neighbors (where k is an integer).

**xliv. Kohonen Feature Map**

A type of neural network that uses unsupervised learning to find patterns in data. In data mining, it is employed for cluster analysis.

**xlv. Layer**

Basically, nodes in a neural net are usually grouped into layers. Also, with each layer described as input, output or hidden. There are as many input nodes as there is input variables and as many output nodes as there is output variables. Typically, there are one or two hidden layers.

**xlvi. Leaf**

A node not further split — the terminal grouping — in a classification or decision tree.

**xlvii. Learning**

Training models (estimating their parameters) based on existing data.

**xlviii. Least Squares**

It is the most common method of training the weights of a model. For this, we need to choose the weights that must minimize the sum of the squared deviation of the predicted values of the model. That is from the observed values of the data.

**xlix. MARS**

Multivariate Adaptive Regression Splines. MARS is a generalization of a decision tree.

**l. Maximum likelihood**

Another training or estimation method. This estimate of a parameter is the value of a parameter that need to maximizes the probability of the data. That the data came from the population defined by the parameter.

**li. Mean**

The arithmetic average value of a collection of numeric data.

**lii. Median**

The value in the middle of a collection of ordered data. In other words, the value with the same number of items above and below it.

**liii. Missing data**

Data values can be missing because they were not measured, not answered, were unknown or were lost. Data mining methods vary in the way they treat missing values.

**liv. Mode**

The most common value in a data set. If more than one value occurs the same number of times, the data is multi-model.

**lv. Node**

A decision point in a classification tree. Also, a point in a neural net that needs to combine input from other nodes. Further, produce an output through an application of an activation function.

**lvi. Noise**

The difference between a model and its predictions. Sometimes data is referred to as noisy as when it contains errors. Such as many missing or incorrect values or when there are extraneous columns.

**lvii. Non-applicable Data**

Missing values that would be logically impossible are obviously not relevant.

**lviii. Normalize**

We can say it is a collection of numeric data that need to be normalized by subtracting the minimum value from all values. And then dividing by the range of the data. This yields data with a similarly shaped histogram but with all values between 0 and 1. It is useful to do this for all inputs into neural nets and also for inputs into other regression models.

**lix. OLAP**

On-Line Analytical Processing tools give the user the capability to perform multi-dimensional analysis of the data.

**lx. Optimization Criterion**

A positive function of the difference between predictions and data estimates that are chosen so as to optimize the function or criterion. Least squares and maximum likelihood are examples.

**lxi. Outliers**

Generally, outliers are data items that did not come from the assumed population of data.

**lxii. Overfitting**

A tendency of some modeling techniques that need to assign importance to random variations in the data. That is by declaring them important patterns.

**lxiii. Overlay**

Data not collected by the organization. Such as data from a proprietary database, that is combined with the organization’s own data.

**lxiv. Parallel processing**

Several computers or CPUs linked together so that each can be computing simultaneously.

**lxv. Prevalence**

The measure of how often the collection of items in an association occur together. That in terms of a percentage of all the transactions.

**lxvi. Pruning**

Eliminating lower level splits in a decision tree. Also, we use this term to describe algorithms. As that adjust the topology of a neural net by removing (i.e., pruning) hidden nodes.

**lxvii. Range**

The range of the data is the difference between the maximum value and the minimum value. Alternatively, a range can include the minimum and maximum, as in “The value ranges from 2 to 8.”

**lxviii. RDBMS**

Relational Database Management System.

**lxix. Regression Tree**

A decision tree that predicts values of continuous variables.

**lxx. Resubstitution Error**

The estimate of error based on the differences between the predicted values. And the observed values in the training set.

**lxxi. Right-hand side**

Whenever we need to define an association between two variables, the second item is the right-hand side.

**lxxii. R-squared**

A number between 0 and 1 that measures how well a model fits its training data. One is a perfect fit; however, zero implies the model has no predictive ability. We compute it as the covariance between the predicted and observed values that was divided by the standard deviations of the predicted and observed values.

**lxxiii. Sampling**

Creating a subset of data from the whole. Random sampling attempts to represent the whole by choosing the sample through a random mechanism.

**lxxiv. Sensitivity Analysis**

Varying the parameters of a model to assess the change in its output.

**lxxv. Sequence Discovery**

The same as an association, except that we consider here the time sequence of events also. For example, “Twenty percent of the people who buy a VCR buy a camcorder within four months.”

**lxxvi. SMP**

Symmetric multi-processing is a computer configuration where many CPUs share a common operating system, main memory, and disks. They can work on different parts of a problem at the same time.

**lxxvii. Standardize**

The collection of techniques where analysis uses a well-defined (known) dependent variable. All regression and classification techniques are supervised.

**lxxviii. Support**

The measure of how often the collection of items in an association occur together is present as a percentage of all the transactions. For example, “In 2% of the purchases at the hardware store, both a pick and a shovel were bought.”

**lxxix. Test data**

A data set independent of the training data set that we use to fine-tune the estimates of the model parameters (i.e., weights).

**lxxx. Test Error**

The estimate of error based on the difference between the predictions of a model on a test data set and the observed values in the test data set when the test data set was not used to train the model.

**lxxxi. Time Series**

A series of measurements taken at consecutive points in time. Data

**lxxxii. Time Series Model**

It’s a type of model that forecasts future values of a time series based on past values.

**lxxxiii. Topology**

For a neural net, topology refers to the number of layers and the number of nodes in each layer.

**lxxxiv. Training**

Another term for estimating a model’s parameters based on the data set at hand.

**lxxxv. Training data**

A data set used to estimate or train a model.

**lxxxvi. Transformation**

It is re-expression of the data such as aggregating it, normalizing it, changing its unit of measure.

**lxxxvii. Unsupervised Learning**

We can say it is group of techniques as in this group, data is defined without the use of a dependent variable.

**lxxxviii. Validation**

The process of testing the models with a data set different from the training dataset.

**lxxxix. Variance**

The most commonly used statistical measure of dispersion. The first step is to square the deviations of a data item from its average value. Then the average of the squared deviations is need to calculate. Thus, to obtain an overall measure of variability.

**xc. Visualization**

Visualization tools graphically display data to facilitate a better understanding of its meaning. Graphical capabilities range from simple scatter plots too complex multi-dimensional representations.

**xci. Windowing**

Used when training a model with time series data. A window is the period of time for each training case.

For example:

Firstly, if we are having weekly stock price data. As that data covers fifty weeks. Then we have to set the window to five weeks. Futher. the first training case uses weeks one through five and compares its prediction to week six. Moreover, the second case uses weeks two through six to predict week seven, and so on.

So, this was all about Data Mining Terminologies. Hope you like our explanation.

## 3. Conclusion

As a result, we have studied Data Mining Terminologies. As these terminologies for data mining will help you to understand each and every small concept related to data mining. Furthermore, if you feel any query feel free to ask in a comment section.

Related Topic – **Clustering In Data Mining**