|


Tell
a Friend
|
Lead Source - Data Mining
Data Mining Glossary
Our location • 12865
Main Street • Suite 642 •
Garden Grove • California
92840
Email • sales@onlythebestleads.com
- Voice
Message Broadcasting
- accuracy
- Accuracy is an important factor in assessing the success
of data mining. When applied to data, accuracy refers to the
rate of correct values in the data. When applied to models,
accuracy refers to the degree of fit between the model and
the data. This measures how error-free the model's
predictions are. Since accuracy does not include cost
information, it is possible for a less accurate model to be
more cost-effective. Also see precision.
- activation function
- A function used by a node in a neural net to transform
input data from any domain of values into a finite range of
values. The original idea was to approximate the way neurons
fired, and the activation function took on the value 0 until
the input became large and the value jumped to 1. The
discontinuity of this 0-or-1 function caused mathematical
problems, and sigmoid-shaped functions (e.g., the logistic
function) are now used.
- antecedent
- When an association between two variables is defined, the
first item (or left-hand side) is called the antecedent. For
example, in the relationship "When a prospector buys a
pick, he buys a shovel 14% of the time," "buys a
pick" is the antecedent.
- API
- An application program interface. When a software system
features an API, it provides a means by which programs
written outside of the system can interface with the system
to perform additional functions. For example, a data mining
software system may have an API which permits user-written
programs to perform such tasks as extract data, perform
additional statistical analysis, create specialized charts,
generate a model, or make a prediction from a model.
- associations
- An association algorithm creates rules that describe how
often events have occurred together. For example, "When
prospectors buy picks, they also buy shovels 14% of the
time." Such relationships are typically expressed with
a confidence interval.
-
- Top
of page
- backpropagation
- A training method used to calculate the weights in a
neural net from the data.
- bias
- In a neural network, bias refers to the constant terms in
the model. (Note that bias has a different meaning to most
data analysts.) Also see precision.
- binning
- A data preparation activity that converts continuous data
to discrete data by replacing a value from a continuous
range with a bin identifier, where each bin represents a
range of values. For example, age could be converted to bins
such as 20 or under, 21-40, 41-65 and over 65.
- bootstrapping
- Training data sets are created by re-sampling with
replacement from the original training set, so data records
may occur more than once. In other words, this method treats
a sample as if it were the entire population. Usually, final
estimates are obtained by taking the average of the
estimates from each of the bootstrap test sets.
- Top
of page
- CART
- Classification And Regression Trees. CART is a method of
splitting the independent variables into small groups and
fitting a constant function to the small data sets. In
categorical trees, the constant function is one that takes
on a finite small set of values (e.g., Y or N, low or medium
or high). In regression trees, the mean value of the
response is fit to small connected data sets.
- categorical data
- Categorical data fits into a small number of discrete
categories (as opposed to continuous). Categorical data is
either non-ordered (nominal) such as gender or city, or
ordered (ordinal) such as high, medium, or low temperatures.
- CHAID
- An algorithm for fitting categorical trees. It relies on
the chi-squared statistic to split the data into small
connected data sets.
- chi-squared
- A statistic that assesses how well a model fits the data.
In data mining, it is most commonly used to find homogeneous
subsets for fitting categorical trees as in CHAID.
- classification
- Refers to the data mining problem of attempting to predict
the category of categorical data by building a model based
on some predictor variables.
- classification tree
- A decision tree that places categorical variables into
classes.
- cleaning (cleansing)
- Refers to a step in preparing data for a data mining
activity. Obvious data errors are detected and corrected
(e.g., improbable dates) and missing data is replaced.
- clustering
- Clustering algorithms find groups of items that are
similar. For example, clustering could be used by an
insurance company to group customers according to income,
age, types of policies purchased and prior claims
experience. It divides a data set so that records with
similar content are in the same group, and groups are as
different as possible from each other. Since the categories
are unspecified, this is sometimes referred to as
unsupervised learning.
- confidence
- Confidence of rule "B given A" is a measure of
how much more likely it is that B occurs when A has
occurred. It is expressed as a percentage, with 100% meaning
B always occurs if A has occurred. Statisticians refer to
this as the conditional probability of B given A. When used
with association rules, the term confidence is observational
rather than predictive. (Statisticians also use this term in
an unrelated way. There are ways to estimate an interval and
the probability that the interval contains the true value of
a parameter is called the interval confidence. So a 95%
confidence interval for the mean has a probability of .95 of
covering the true value of the mean.)
- confusion matrix
- A confusion matrix shows the counts of the actual versus
predicted class values. It shows not only how well the model
predicts, but also presents the details needed to see
exactly where things may have gone wrong.
- consequent
- When an association between two variables is defined, the
second item (or right-hand side) is called the consequent.
For example, in the relationship "When a prospector
buys a pick, he buys a shovel 14% of the time,"
"buys a shovel" is the consequent.
- continuous
- Continuous data can have any value in an interval of real
numbers. That is, the value does not have to be an integer.
Continuous is the opposite of discrete or categorical.
- cross validation
- A method of estimating the accuracy of a classification or
regression model. The data set is divided into several
parts, with each part in turn used to test a model fitted to
the remaining parts.
-
- Top
of page
- data
- Values collected through record keeping or by polling,
observing, or measuring, typically organized for analysis or
decision making. More simply, data is facts, transactions
and figures.
- data format
- Data items can exist in many formats such as text, integer
and floating-point decimal. Data format refers to the form
of the data in the database.
- data mining
- An information extraction activity whose goal is to
discover hidden facts contained in databases. Using a
combination of machine learning, statistical analysis,
modeling techniques and database technology, data mining
finds patterns and subtle relationships in data and infers
rules that allow the prediction of future results. Typical
applications include market segmentation, customer
profiling, fraud detection, evaluation of retail promotions,
and credit risk analysis.
- data mining method
- Procedures and algorithms designed to analyze the data in
databases.
- DBMS
- Database management systems.
- decision tree
- A tree-like way of representing a collection of
hierarchical rules that lead to a class or value.
- deduction
- Deduction infers information that is a logical consequence
of the data.
- degree of fit
- A measure of how closely the model fits the training data.
A common measure is r-square.
- dependent variable
- The dependent variables (outputs or responses) of a model
are the variables predicted by the equation or rules of the
model using the independent variables (inputs or
predictors).
- deployment
- After the model is trained and validated, it is used to
analyze new data and make predictions. This use of the model
is called deployment.
- dimension
- Each attribute of a case or occurrence in the data being
mined. Stored as a field in a flat file record or a column
of relational database table.
- discrete
- A data item that has a finite set of values. Discrete is
the opposite of continuous.
- discriminant analysis
- A statistical method based on maximum likelihood for
determining boundaries that separate the data into
categories.
-
- Top
of page
- entropy
- A way to measure variability other than the variance
statistic. Some decision trees split the data into groups
based on minimum entropy.
- exploratory analysis
- Looking at data to discover relationships not previously
detected. Exploratory analysis tools typically assist the
user in creating tables and graphical displays.
- external data
- Data not collected by the organization, such as data
available from a reference book, a government source or a
proprietary database.
- Top
of page
- feed-forward
- A neural net in which the signals only flow in one
direction, from the inputs to the outputs.
- fuzzy logic
- Fuzzy logic is applied to fuzzy sets where membership in a
fuzzy set is a probability, not necessarily 0 or 1.
Non-fuzzy logic manipulates outcomes that are either true or
false. Fuzzy logic needs to be able to manipulate degrees of
"maybe" in addition to true and false.
- Top
of page
- genetic algorithms
- A computer-based method of generating and testing
combinations of possible input parameters to find the
optimal output. It uses processes based on natural evolution
concepts such as genetic combination, mutation and natural
selection.
- GUI
- Graphical User Interface.
-
- Top
of page
- hidden nodes
- The nodes in the hidden layers in a neural net. Unlike
input and output nodes, the number of hidden nodes is not
predetermined. The accuracy of the resulting model is
affected by the number of hidden nodes. Since the number of
hidden nodes directly affects the number of parameters in
the model, a neural net needs a sufficient number of hidden
nodes to enable it to properly model the underlying
behavior. On the other hand, a net with too many hidden
nodes will overfit the data. Some neural net products
include algorithms that search over a number of alternative
neural nets by varying the number of hidden nodes, in the
end choosing the model that gets the best results without
overfitting.
-
- Top
of page
- independent variable
- The independent variables (inputs or predictors) of a
model are the variables used in the equation or rules of the
model to predict the output (dependent) variable.
- induction
- A technique that infers generalizations from the
information in the data.
- interaction
- Two independent variables interact when changes in the
value of one change the effect on the dependent variable of
the other.
- internal data
- Data collected by an organization such as operating and
customer data.
- Top
of page
- k-nearest neighbor
- A classification method that classifies a point by
calculating the distances between the point and points in
the training data set. Then it assigns the point to the
class that is most common among its k-nearest neighbors
(where k is an integer).
- Kohonen feature map
- A type of neural network that uses unsupervised learning
to find patterns in data. In data mining it is employed for
cluster analysis.
-
- Top
of page
- layer
- Nodes in a neural net are usually grouped into layers,
with each layer described as input, output or hidden. There
are as many input nodes as there are input (independent)
variables and as many output nodes as there are output
(dependent) variables. Typically, there are one or two
hidden layers.
- leaf
- A node not further split -- the terminal grouping -- in a
classification or decision tree.
- learning
- Training models (estimating their parameters) based on
existing data.
- least squares
- The most common method of training (estimating) the
weights (parameters) of a model by choosing the weights that
minimize the sum of the squared deviation of the predicted
values of the model from the observed values of the data.
- left-hand side
- When an association between two variables is defined, the
first item is called the left-hand side (or antecedent). For
example, in the relationship "When a prospector buys a
pick, he buys a shovel 14% of the time", "buys a
pick" is the left-hand side.
- logistic regression (logistic discriminant analysis)
- A generalization of linear regression. It is used for
predicting a binary variable (with values such as yes/no or
0/1). An example of its use is modeling the odds that a
borrower will default on a loan based on the borrower's
income, debt and age.
-
- Top
of page
- MARS
- Multivariate Adaptive Regression Splines. MARS is a
generalization of a decision tree.
- maximum likelihood
- Another training or estimation method. The maximum
likelihood estimate of a parameter is the value of a
parameter that maximizes the probability that the data came
from the population defined by the parameter.
- mean
- The arithmetic average value of a collection of numeric
data.
- median
- The value in the middle of a collection of ordered data.
In other words, the value with the same number of items
above and below it.
- missing data
- Data values can be missing because they were not measured,
not answered, were unknown or were lost. Data mining methods
vary in the way they treat missing values. Typically, they
ignore the missing values, or omit any records containing
missing values, or replace missing values with the mode or
mean, or infer missing values from existing values.
- mode
- The most common value in a data set. If more than one
value occurs the same number of times, the data is
multi-modal.
- model
- An important function of data mining is the production of
a model. A model can be descriptive or predictive. A
descriptive model helps in understanding underlying
processes or behavior. For example, an association model
describes consumer behavior. A predictive model is an
equation or set of rules that makes it possible to predict
an unseen or unmeasured value (the dependent variable or
output) from other, known values (independent variables or
input). The form of the equation or rules is suggested by
mining data collected from the process under study. Some
training or estimation technique is used to estimate the
parameters of the equation or rules.
- MPP
- Massively parallel processing, a computer configuration
that is able to use hundreds or thousands of CPUs
simultaneously. In MPP each node may be a single CPU or a
collection of SMP CPUs. An MPP collection of SMP nodes is
sometimes called an SMP cluster. Each node has its own copy
of the operating system, memory, and disk storage, and there
is a data or process exchange mechanism so that each
computer can work on a different part of a problem. Software
must be written specifically to take advantage of this
architecture.
- Top
of page
- neural network
- A complex nonlinear modeling technique based on a model of
a human neuron. A neural net is used to predict outputs
(dependent variables) from a set of inputs (independent
variables) by taking linear combinations of the inputs and
then making nonlinear transformations of the linear
combinations using an activation function. It can be shown
theoretically that such combinations and transformations can
approximate virtually any type of response function. Thus,
neural nets use large numbers of parameters to approximate
any model. Neural nets are often applied to predict future
outcome based on prior experience. For example, a neural net
application could be used to predict who will respond to a
direct mailing.
- node
- A decision point in a classification (i.e., decision)
tree. Also, a point in a neural net that combines input from
other nodes and produces an output through application of an
activation function.
- noise
- The difference between a model and its predictions.
Sometimes data is referred to as noisy when it contains
errors such as many missing or incorrect values or when
there are extraneous columns.
- non-applicable data
- Missing values that would be logically impossible (e.g.,
pregnant males) or are obviously not relevant.
- normalize
- A collection of numeric data is normalized by subtracting
the minimum value from all values and dividing by the range
of the data. This yields data with a similarly shaped
histogram but with all values between 0 and 1. It is useful
to do this for all inputs into neural nets and also for
inputs into other regression models. (Also see standardize.)
- Top
of page
- OLAP
- On-Line Analytical Processing tools give the user the
capability to perform multi-dimensional analysis of the
data.
- optimization criterion
- A positive function of the difference between predictions
and data estimates that are chosen so as to optimize the
function or criterion. Least squares and maximum likelihood
are examples.
- outliers
- Technically, outliers are data items that did not (or are
thought not to have) come from the assumed population of
data -- for example, a non-numeric when you are expecting
only numeric values. A more casual usage refers to data
items that fall outside the boundaries that enclose most
other data items in the data set.
- overfitting
- A tendency of some modeling techniques to assign
importance to random variations in the data by declaring
them important patterns.
- overlay
- Data not collected by the organization, such as data from
a proprietary database, that is combined with the
organization's own data.
- Top
of page
- parallel processing
- Several computers or CPUs linked together so that each can
be computing simultaneously.
- pattern
- Analysts and statisticians spend much of their time
looking for patterns in data. A pattern can be a
relationship between two variables. Data mining techniques
include automatic pattern discovery that makes it possible
to detect complicated non-linear relationships in data.
Patterns are not the same as causality.
- precision
- The precision of an estimate of a parameter in a model is
a measure of how variable the estimate would be over other
similar data sets. A very precise estimate would be one that
did not vary much over different data sets. Precision does
not measure accuracy. Accuracy is a measure of how close the
estimate is to the real value of the parameter. Accuracy is
measured by the average distance over different data sets of
the estimate from the real value. Estimates can be accurate
but not precise, or precise but not accurate. A precise but
inaccurate estimate is usually biased, with the bias equal
to the average distance from the real value of the
parameter.
- predictability
- Some data mining vendors use predictability of
associations or sequences to mean the same as confidence.
- prevalence
- The measure of how often the collection of items in an
association occur together as a percentage of all the
transactions. For example, "In 2% of the purchases at
the hardware store, both a pick and a shovel were
bought."
- pruning
- Eliminating lower level splits or entire sub-trees in a
decision tree. This term is also used to describe algorithms
that adjust the topology of a neural net by removing (i.e.,
pruning) hidden nodes.
- Top
of page
- range
- The range of the data is the difference between the
maximum value and the minimum value. Alternatively, range
can include the minimum and maximum, as in "The value
ranges from 2 to 8."
- RDBMS
- Relational Database Management System.
- regression tree
- A decision tree that predicts values of continuous
variables.
- resubstitution error
- The estimate of error based on the differences between the
predicted values of a trained model and the observed values
in the training set.
- right-hand side
- When an association between two variables is defined, the
second item is called the right-hand side (or consequent).
For example, in the relationship "When a prospector
buys a pick, he buys a shovel 14% of the time,"
"buys a shovel" is the right-hand side.
- r-squared
- A number between 0 and 1 that measures how well a model
fits its training data. One is a perfect fit; however, zero
implies the model has no predictive ability. It is computed
as the covariance between the predicted and observed values
divided by the standard deviations of the predicted and
observed values.
-
- Top
of page
- sampling
- Creating a subset of data from the whole. Random sampling
attempts to represent the whole by choosing the sample
through a random mechanism.
- sensitivity analysis
- Varying the parameters of a model to assess the change in
its output.
- sequence discovery
- The same as association, except that the time sequence of
events is also considered. For example, "Twenty percent
of the people who buy a VCR buy a camcorder within four
months."
- significance
- A probability measure of how strongly the data support a
certain result (usually of a statistical test). If the
significance of a result is said to be .05, it means that
there is only a .05 probability that the result could have
happened by chance alone. Very low significance (less than
.05) is usually taken as evidence that the data mining model
should be accepted since events with very low probability
seldom occur. So if the estimate of a parameter in a model
showed a significance of .01 that would be evidence that the
parameter must be in the model.
- SMP
- Symmetric multi-processing is a computer configuration
where many CPUs share a common operating system, main memory
and disks. They can work on different parts of a problem at
the same time.
- standardize
- A collection of numeric data is standardized by
subtracting a measure of central location (such as the mean
or median) and by dividing by some measure of spread (such
as the standard deviation, interquartile range or range).
This yields data with a similarly shaped histogram with
values centered around 0. It is sometimes useful to do this
with inputs into neural nets and also inputs into other
regression models. (Also see normalize.)
- supervised learning
- The collection of techniques where analysis uses a
well-defined (known) dependent variable. All regression and
classification techniques are supervised.
- support
- The measure of how often the collection of items in an
association occur together as a percentage of all the
transactions. For example, "In 2% of the purchases at
the hardware store, both a pick and a shovel were
bought."
-
- Top
of page
- test data
- A data set independent of the training data set, used to
fine-tune the estimates of the model parameters (i.e.,
weights).
- test error
- The estimate of error based on the difference between the
predictions of a model on a test data set and the observed
values in the test data set when the test data set was not
used to train the model.
- time series
- A series of measurements taken at consecutive points in
time. Data mining products which handle time series
incorporate time-related operators such as moving average.
(Also see windowing.)
- time series model
- A model that forecasts future values of a time series
based on past values. The model form and training of the
model usually take into consideration the correlation
between values as a function of their separation in time.
- topology
- For a neural net, topology refers to the number of layers
and the number of nodes in each layer.
- training
- Another term for estimating a model's parameters based on
the data set at hand.
- training data
- A data set used to estimate or train a model.
- transformation
- A re-expression of the data such as aggregating it,
normalizing it, changing its unit of measure, or taking the
logarithm of each data item.
- Top
of page
- unsupervised learning
- This term refers to the collection of techniques where
groupings of the data are defined without the use of a
dependent variable. Cluster analysis is an example.
-
- Top
of page
- validation
- The process of testing the models with a data set
different from the training data set.
- variance
- The most commonly used statistical measure of dispersion.
The first step is to square the deviations of a data item
from its average value. Then the average of the squared
deviations is calculated to obtain an overall measure of
variability.
- visualization
- Visualization tools graphically display data to facilitate
better understanding of its meaning. Graphical capabilities
range from simple scatter plots to complex multi-dimensional
representations.
- Top
of page
- windowing
- Used when training a model with time series data. A window
is the period of time used for each training case. For
example, if we have weekly stock price data that covers
fifty weeks, and we set the window to five weeks, then the
first training case uses weeks one through five and compares
its prediction to week six. The second case uses weeks two
through six to predict week seven, and so on.
Bookmark
Only the Best Leads - Lead Source
|