Data Mining Terminologies and Predictive Analytics Terms
Free Machine Learning courses with 130+ real-time projects Start Now!!
In this Data Mining Tutorial, we will study Data Mining Terminologies. We will cover each and every Data Mining Terminologies related to every domain. Moreover, we will discuss some predictive analytics terms used in Data Mining.
So, let’s start Data Mining Terminologies.
Data Mining Terminologies
Let’s begin data mining terminologies:
i. Data Mining
We use data mining to extract information from a huge set of data. Also, we can use this information for any of the following applications −
- Market Analysis
- Fraud Detection
- Customer Retention
- Production Control
- Science Exploration
ii. Data Mining Engine
It is very important to the data mining system. Also, it consists of too many set of function modules. They perform the following functions.
- Characterization
- Association and Correlation Analysis
- Classification
- Prediction
- Cluster analysis
- Outlier analysis
- Evolution analysis
iii. Knowledge Base
We can say this is domain knowledge. We use this to guide the search.
- Knowledge Discovery
- Cleaning of data
- Data Integration
- Selection of data
- Transformation of data
- Data Mining
- Pattern Evaluation
- Knowledge Presentation
iv. User Interface
- Interact with the system by specifying a data mining query task.
- Providing information to help focus the search.
- Mining based on the intermediate data mining results.
- Browse database and data warehouse schemas or data structures.
- Evaluate mined patterns.
- Visualize the patterns in different forms.
v. Data Integration
It is a data pre-processing technique. We use it to merge the data from multiple heterogeneous data sources into a coherent data store. Also, it involves inconsistent data and therefore needs data cleaning.
vi. Associations
An association is a type of algorithm. We use it to create rules that describe how often events have occurred together.
vii. Backpropagation
It is a type of training method. Also, we use it to calculate the weights in a neural net from the data.
viii. Binning
It is a type of data preparation activity. As we use data mining to convert continuous data to discrete data. Also, to convert it we need to replace a value from a continuous range with a bin identifier.
ix. CART
CART refers to Classification And Regression Trees. As in this method, we have to split the independent variables into small groups. And, fitting a constant function to the small data sets. Although, the constant function is one that takes on a finite small set of values. While in regression trees, the mean value of the response is fit to small connected data sets.
x. Categorical data
Generally, categorical data fits into a small number of discrete categories. Also, Categorical data is defined in a particular way. That is either non-ordered such as gender or city, or ordered such as high, medium, or low temperatures.
xi. CHAID
Basically, it’s an algorithm. That we use for fitting categorical trees. Also, it relies on the chi-squared statistic to split the data into small connected data sets.
xii. Chi-squared
Chi-Square is defined as a statistic assessment that defines how well a model fits the data. Also, we use it in data mining to find homogeneous subsets for fitting categorical trees as in CHAID.
xiii. Classification
It refers to the data mining problem. Also, we have to predict the category of categorical data by building a model. That model must base on some predictor variables.
xiv. Classification tree
A decision tree that places categorical variables into classes.
xv. Cleaning (cleansing)
It is a process of preparing data for a data mining activity. Obvious data errors are detected and corrected and missing data is replaced.
xvi. Confusion matrix
We use this matrix to show the count of the actual versus predicted class values. It shows not only how well the model predicts but also presents the details needed to see exactly where things may have gone wrong.
xvii. Consequent
Whenever an association between two variables is defined, the second item is called the consequent.
xviii. Continuous
Continuous data can have any value in an interval of real numbers. That is, the value does not have to be an integer. Continuous is the opposite of discrete or categorical.
xix. Cross-validation
A method of estimating the accuracy of a classification or regression model. The data set is divided into several parts, with each part in turn used to test a model fitted to the remaining parts.
xx. Data
Data is defined as facts, transactions, and figures.
xxi. DBMS
It refers to database management systems.
xxii. Data format
Data items can exist in many formats such as text, integer, and floating-point decimal. The form of the data in the database is data format.
xxiii. Decision Tree
We use it to represent a collection of hierarchical rules that lead to a class or value.
xxiv. Data Mining method
In this, procedure and algorithms are designed to analyze the data in databases.
xxv. Deduction
Deduction infers information that is a logical consequence of the data.
xxvi. Degree of fit
A measure of how closely the model fits the training data. A common measure is the r-square.
xxvii. Dependent Variable
These are the variables of the model. These need to be predicted by the equation of the model using the independent variables.
xxviii. Deployment
Once the model is trained and validated, then we use it to analyze new data and make predictions. Hence, the use of the model is called deployment.
xxix. Dimension
Each attribute of a case or occurrence in the data being mined. Also, stored as a field in a flat file record or a column of a relational database table.
xxx. Discrete
A data item that has a finite set of values. Discrete is the opposite of continuous.
xxxi.Discriminant analysis
It a type of statistical method that is based on the maximum likelihood for determining boundaries. Boundaries must separate the data into categories.
xxxii. Entropy
A way to measure variability other than the variance statistic. Some decision trees split the data into groups based on minimum entropy.
xxxiii. Exploratory Analysis
Looking at data to discover relationships not previously detected. Exploratory analysis tools typically assist the user in creating tables and graphical displays.
xxxiv. External Data
In this, data is not collected by the organization. Such as data available from a reference book, a government source.
xxxv.Feed-forward
A neural net in which the signals only flow in one direction, from the inputs to the outputs.
xxxvi. Fuzzy Logic
Fuzzy logic is applied to fuzzy sets where membership in a fuzzy set is a probability, not necessarily 0 or 1. Non-fuzzy logic manipulates outcomes that are either true or false. Fuzzy logic needs to be able to manipulate degrees of “maybe” in addition to true and false.
xxxvii. Genetic Algorithms
A computer-based method of generating and testing combinations of possible input parameters. That need to find the optimal output. It uses processes based on natural evolution concepts. Such as genetic combination, mutation, and natural selection.
xxxviii. GUI
Graphical User Interface.
xxxix. Independent variable
These variables of a model are the variables used in the equation. That need to predict the output variable.
xl. Induction
A technique that infers generalizations from the information in the data.
xli. Interaction
It occurs only when two independent variables interact. Whenever changes in the value of one change the effect on the dependent variable of the other.
xlii. Internal data
Data collected by an organization such as operating and customer data.
xliii. k-nearest neighbor
In this, a classification method is present that classifies a point by calculating the distances between the points. Then it assigns the point to the class that is most common among its k-nearest neighbors (where k is an integer).
xliv. Kohonen Feature Map
A type of neural network that uses unsupervised learning to find patterns in data. In data mining, it is employed for cluster analysis.
xlv. Layer
Basically, nodes in a neural net are usually grouped into layers. Also, with each layer described as input, output or hidden. There are as many input nodes as there is input variables and as many output nodes as there is output variables. Typically, there are one or two hidden layers.
xlvi. Leaf
A node not further split — the terminal grouping — in classification or decision tree.
xlvii. Learning
Training models (estimating their parameters) based on existing data.
xlviii. Least Squares
It is the most common method of training the weights of a model. For this, we need to choose the weights that must minimize the sum of the squared deviation of the predicted values of the model. That is from the observed values of the data.
xlix. MARS
Multivariate Adaptive Regression Splines. MARS is a generalization of a decision tree.
l. Maximum likelihood
Another training or estimation method. This estimate of a parameter is the value of a parameter that need to maximizes the probability of the data. That the data came from the population defined by the parameter.
li. Mean
The arithmetic average value of a collection of numeric data.
lii. Median
The value in the middle of a collection of ordered data. In other words, the value with the same number of items above and below it.
liii. Missing data
Data values can be missing because they were not measured, not answered, were unknown or were lost. Data mining methods vary in the way they treat missing values.
liv. Mode
The most common value in a data set. If more than one value occurs the same number of times, the data is multi-model.
lv. Node
A decision point in a classification tree. Also, a point in a neural net that needs to combine input from other nodes. Further, produce an output through an application of an activation function.
lvi. Noise
The difference between a model and its predictions. Sometimes data is referred to as noisy as when it contains errors. Such as many missing or incorrect values or when there are extraneous columns.
lvii. Non-applicable Data
Missing values that would be logically impossible are obviously not relevant.
lviii. Normalize
We can say it is a collection of numeric data that need to be normalized by subtracting the minimum value from all values. And then dividing by the range of the data. This yields data with a similarly shaped histogram but with all values between 0 and 1. It is useful to do this for all inputs into neural nets and also for inputs into other regression models.
lix. OLAP
On-Line Analytical Processing tools give the user the capability to perform multi-dimensional analysis of the data.
lx. Optimization Criterion
A positive function of the difference between predictions and data estimates that are chosen so as to optimize the function or criterion. The least squares and maximum likelihood are examples.
lxi. Outliers
Generally, outliers are data items that did not come from the assumed population of data.
lxii. Overfitting
A tendency of some modeling techniques that need to assign importance to random variations in the data. That is by declaring them important patterns.
lxiii. Overlay
Data not collected by the organization. Such as data from a proprietary database, that is combined with the organization’s own data.
lxiv. Parallel processing
Several computers or CPUs linked together so that each can be computing simultaneously.
lxv. Prevalence
The measure of how often the collection of items in an association occur together. That in terms of a percentage of all the transactions.
lxvi. Pruning
Eliminating lower-level splits in a decision tree. Also, we use this term to describe algorithms. As that adjusts the topology of a neural net by removing (i.e., pruning) hidden nodes.
lxvii. Range
The range of the data is the difference between the maximum value and the minimum value. Alternatively, a range can include the minimum and maximum, as in “The value ranges from 2 to 8.”
lxviii. RDBMS
Relational Database Management System.
lxix. Regression Tree
A decision tree that predicts values of continuous variables.
lxx. Resubstitution Error
The estimate of error based on the differences between the predicted values. And the observed values in the training set.
lxxi. Right-hand side
Whenever we need to define an association between two variables, the second item is the right-hand side.
lxxii. R-squared
A number between 0 and 1 that measures how well a model fits its training data. One is a perfect fit; however, zero implies the model has no predictive ability. We compute it as the covariance between the predicted and observed values that was divided by the standard deviations of the predicted and observed values.
lxxiii. Sampling
Creating a subset of data from the whole. Random sampling attempts to represent the whole by choosing the sample through a random mechanism.
lxxiv. Sensitivity Analysis
Varying the parameters of a model to assess the change in its output.
lxxv. Sequence Discovery
The same as an association, except that we consider here the time sequence of events also. For example, “Twenty percent of the people who buy a VCR buy a camcorder within four months.”
lxxvi. SMP
Symmetric multi-processing is a computer configuration where many CPUs share a common operating system, main memory, and disks. They can work on different parts of a problem at the same time.
lxxvii. Standardize
The collection of techniques where analysis uses a well-defined (known) dependent variable. All regression and classification techniques are supervised.
lxxviii. Support
The measure of how often the collection of items in an association occur together is present as a percentage of all the transactions. For example, “In 2% of the purchases at the hardware store, both a pick and a shovel were bought.”
lxxix. Test data
A data set independent of the training data set that we use to fine-tune the estimates of the model parameters (i.e., weights).
lxxx. Test Error
The estimate of error based on the difference between the predictions of a model on a test data set and the observed values in the test data set when the test data set was not used to train the model.
lxxxi. Time Series
A series of measurements taken at consecutive points in time. Data
lxxxii. Time Series Model
It’s a type of model that forecasts future values of a time series based on past values.
lxxxiii. Topology
For a neural net, topology refers to the number of layers and the number of nodes in each layer.
lxxxiv. Training
Another term for estimating a model’s parameters based on the data set at hand.
lxxxv. Training data
A data set used to estimate or train a model.
lxxxvi. Transformation
It is re-expression of the data such as aggregating it, normalizing it, changing its unit of measure.
lxxxvii. Unsupervised Learning
We can say it is a group of techniques as in this group, data is defined without the use of a dependent variable.
lxxxviii. Validation
The process of testing the models with a data set different from the training dataset.
lxxxix. Variance
The most commonly used statistical measure of dispersion. The first step is to square the deviations of a data item from its average value. Then the average of the squared deviations needs to calculate. Thus, to obtain an overall measure of variability.
xc. Visualization
Visualization tools graphically display data to facilitate a better understanding of its meaning. Graphical capabilities range from simple scatter plots too complex multi-dimensional representations.
xci. Windowing
Used when training a model with time series data. A window is the period of time for each training case.
For example:
Firstly, if we are having weekly stock price data. As that data covers fifty weeks. Then we have to set the window to five weeks. Further, the first training case uses weeks one through five and compares its prediction to week six. Moreover, the second case uses weeks two through six to predict week seven, and so on.
So, this was all about Data Mining Terminologies. Hope you like our explanation.
Conclusion
As a result, we have studied Data Mining Terminologies. As these terminologies for data mining will help you to understand each and every small concept related to data mining. Furthermore, if you feel any query feel free to ask in a comment section.
Did we exceed your expectations?
If Yes, share your valuable feedback on Google