数据挖掘概述
数据挖掘的任务
Task of Data Mining(数据挖掘)
- Predicive task(预测任务):Use some variables to predict unknown or future values of other variables
- Classification 分类问题
- Regression 回归问题
- Deviation Detection 异常检测问题
- Descriptive task(描述任务):Find human-interpretable patterns that describe the underlying relationships in data
- Clustering 聚类分析
- Association rules discovery 关联规则分析
- Sequential Pattern Discovery 有序模式挖掘
Classification
Find a model for class attribute as a function of the values of other attributes
used for discrete target variables 离散变量
应用:信用卡交易(合法|不合法),肿瘤细胞(良性|恶性),Image object recognition(图像目标识别)
Regression
used for continuous target variables 连续变量
应用:预测风速,预测股票趋势,
Deviation/Anomaly Detection
Detect significant deviations from normal behavior
应用:信用卡欺诈检测,网络攻击
Clustering
应用:Document Clustering (文档分类),图片聚类
Association Rule Discovery
应用:精准销售
Sequential Pattern Discovery
数据
数据集(data set):A data set can be viewed as a collection of data objects. Object is also known as record, point, or instance
数据对象(data object):data objects are described by a number of attributes. An attribute is a property or characteristic of an object. Attribute is also known as variable, field, characteristic, or feature.
属性也称为变量,字段,特征值
数据属性类型
属性类型:
分类一:
- 定性属性(Qualitative):区分对象,不具有数的性质
- 标称(Nominal):如:学号;眼睛的颜色、邮政编码
- 序数(Ordinal):如:排名(优秀,中等,差)
- 定量属性(Quantitative):具有数的大部分性质
- 区间(Interval):如:日期
- 比值(Ratio):如:年龄
对应四种操作:
- 显著性(Distinctness): $=$ $\neq$
- 有序性(Order): $<$ $>$
- $+$ $-$
- $*$ $/$
分类二:
离散属性 Discrete attribute:邮政编码
连续属性 Continuous Attribute:温度,身高,体重
数据类型
分三大类:
- 记录 Record
- Data Matrix 数据矩阵
- Document Data 文档数据
- Transaction Data 交易数据
- 图像 Graph
- World Wide Web 网页数据
- Molecular Structures 分子结构数据
- 有序 Ordered
- Temporal Data 时间数据
- Genetic Sequence Data 基因序列数据
数据预处理
数据预处理(Data Preprocessing)是数据挖掘前的准备工作
Data is not perfect:
- Nosie 噪声
- missing values 数据缺失
- Inconsistent 不一致
- duplicate data 重复数据
数据预处理的方法:
- Aggregation 聚集
- Sampling 采样
- Dimensionality Reduction 维度降低
- Feature subset selection 特征子集选取
- Feature creation 特征创建
- Attribute Transformation 属性变换
1和2主要用于减少数据的数量;3和4主要用于减少数据的属性;5和6主要用于创建或改变数据的属性
Aggregation
- 优点:Data reduction , More stable data
- 缺点:lost of interesting details
Sampling
- 原则:采样得到的样品要有代表性
- 要考虑样本容量 sampling sizes
1)简单随机采样 Simple Random Sampling
- Sampling without replacement 不放回采样
- Sampling with replacement
2)分层采样 Stratified sampling
Dimensionality Reduction
目的:
- Avoid curse of dimensionality 避免维度灾难
- Reduce amount of time and memory required by data mining algorithms 减少数据挖掘所需的时间和内存
- Allow data to be more easily visualized 使数据更容易可视化
法1)Principal Components Analysis(PCA) 主成分分析
PCA is a linear algebra technique that projecting the data from ist original high-dimensional space into a lower-dimensional space.
Feature Subset Selection
使用场景:数据中存在 Redundant features 冗余特征 和 Irrelevant features 不相关特征
Feature Creation
三种常见的方法:
1)Feature Extraction 特征提取
- example:图片 —> 边缘特征
2)Feature Construction 属性创造
- example:体积和质量 —> 密度
3)Mapping Data to New Space 把数据映射至新的空间
- Fourier transform 傅里叶变换
Attribute Transformation
A function that maps the entire set of values of a given attribute to a new set of replacement values such that each old value can be identified with one of the new values.
1)Simple functions 简单函数变换
- example:$x^k$ 、 $log(x)$ 、 $e^x$ 、 $|x|$
- variable transformations should be applied with caution since they change the nature of the data. 应谨慎应用变量转换,因为它们会改变数据的性质
2)Standardization / Normalization 规范化 / 归一化
- the magnitude and value of a variable is different
- $X’=(x-MinValue)/(MaxValue-MinValue)$
分类
1)建立分类模型
2)Apply Model to Test Data
分类主要涉及的两个过程:Induction归纳 和 Deduction演绎
决策树 Decision Tree
建立决策树的算法有:
- Hunt‘s Algorithm 最简单的
- CART
- ID3, C4.5
- SLIQ, SPRINT
Decision Tree has three types of node 决策树有三种结点:
- 根结点 A root node —> no incoming links and zero or more outgoing links
- 内部结点 Internal nodes —> one incoming link and two or more outgoing links
- 叶节点或终端结点 Leaf or terminal nodes —> one incoming link and no outgoing links
不纯度 Impurity
对单个结点的不纯度的度量的三种方法:
1)Gini Index 基尼指数 $GINI(t) = 1 - \sum_j[p(j|t)]^2$
2)Entropy 熵 $Entropy(t) = - \sum_j p(j|t)\log_2p(j|t)$
3)Misclassification error 类别误差 $Error(t) = 1 - \max P(i|t)$
注)$p(j|t)$ 表示在结点$t$上属于类别$j$的概率
基尼指数示例:
Hunt’s Algorithm
Hunt’s Algotithm 按下面的步骤递归
step 1)如果数据 Dt 属于相同的类别 yt ,则 t 是一个叶节点yt
step 2)如果数据 Dt 来自多个类别,使用属性将数据拆分成更小的子集。递归应用
Hunt’s algorithm is a generic procedure for growing decision trees in a greedy fashion 以贪心策略生长决策树的通用过程
Hunt’s algorithm grows a decision tree by making a series of locally optimum decisions, without considering the overall optimum. 找出局部最优,没有考虑全局最优
There could be more than one tree that fits the same data! 决策树不唯一
实现Hunt’s algorithm的三个关键问题
- 最优特征选择 determine the best split
- 分支问题 how many branches can be generated
- 停止生长问题 determine when to stop splitting
最优特征选择
- 计算结点分割前的不纯度P
- 计算结点分割后的不纯度M
- 计算每个子节点的不纯度,再加权平均(weighted impurity)得到M
- 基尼指数:$\Large GINI_{split}=\sum^k_{i=1}\frac{n_i}{n}GINI(i)$
- 熵:$\Large Entropy_{split}=\sum^k_{i=1}\frac{n_i}{n}Entropy(i)$
- 类别误差:$\Large Error_{split}=\sum^k_{i=1}\frac{n_i}{n}Error(i)$
- $n_i$为子节点的数据数量,$n$为父节点的数据数量
- 选择一个能产生最大增益(highest gain)的属性分割,最大增益=P-M
- 即选择M的值最小的一个
分支问题
依赖于属性的类型:Nominal ,Ordinal , Continuous
依赖于属性的值:2-way split 2路划分, Multi-way split 多路划分
1)标称属性:
2)有序属性:划分时不能违背有序特征
3)连续属性:选择合适的分界点来划分
如何选择合适的分界点:四个步骤
- 排序 Sort the attribute on values
- 分界点 Give split poisons
- 计算每个分界点的基尼指数 Linearly scan these values, each time updating the count matrix and computing gini index
- 选择最低GINI的分割点 Choose the split position that has the least gini index
示例:
1)排序
2)给出分界点
3)计算对应的GINI值
4)选择最优的或局部最优的
停止生长问题
Stop expanding a node when all the records belong to the same class
提前终止条件 Early termination (the number of records below minimum threshold)
模型过拟合
A good classification model must not only fit the training data well, it must alse accurately classify record it has never seen before.
过拟合 overfitting: when model is too complex, training error continues to decrease and test error begins to increase
欠拟合 underfitting:when model is too simple, both training and test errors are large
过拟合的原因:
1)噪声 Noise
2)训练集有限 Insufficient Examples
3)过高的模型复杂度 High Model Complexity
奥卡姆剃刀原理 Occam’s Razor
Given two models of similar errors, one should prefer the simpler model over the more complex model 相似的错误率,应选择简单的模型
one should include model complexity when evaluating a model 评估模型时应考虑模型的复杂度
过拟合的情况下,training error不再能对 模型在测试数据上的表现 提供很好的评估效果,需要新的方法进行错误率估计
Methods for estimating generalization errors
1)悲观误差估计 Pessimistic error estimate
- $e’(T) = e(T) + N * 0.5/Nt$
- $N$ —> 叶节点数
- $Nt$ —> 训练集的数量
- $e(T)$ —> 训练误差
- $e’(T)$ —> 泛化误差
- eg:For a tree with 30 leaf nodes and 10 errors on training (out of 1000 instances) ==> $Training error = 10/1000=1%$ $Generalization error = 1\% + 30*0.5/1000=2.5%$
2)最小描述长度原则 Minimum Description Length(MDL)
- $Cost(Model, Data)=Cost(Model)+Cost(Data|Model)$
- $Cost(Model)$ —> encoding the model = attribute encoding + leaf encoding 模型编码
- $Cost(Data|Model)$ —> encoding the mislabeled records 错误的分类记录所需要的编码
模型评估
Model Evaluation
模型评估常用指标:
1)Confusion Matrix 混淆矩阵
模型准确度:$\Large Accuracy=\frac{a+d}{a+b+c+d}=\frac{TP+TN}{TP+TN+FP+FN}$
eg:
局限性:
2)查准率或精度 Precision
Precision:exactness what % of tuples that the classifier labeled as positive are actually positive
Recall:completeness what % of positive tuples did the classifier label as positive
F measure(F1 or F-score):harmonic mean of precision and recall
可靠模型评估的方法 Methods for Performance Evaluation
1)保持方法 Holdout method
- 将数据集划分成不相交的两部分 Given data is randomly partitioned into two independent sets
- 训练集 Training set(2/3的data set)for model construction
- 测试集 Test set(1/3的data set)for accuracy estimation
2)随机子抽样 Random subsampling a variation of holdout
- 重复k次保持方法,模型的精度是由k次平均值 Repeat holdout k times, accuracy = avg. of the accuracies obtained
3)交叉验证 Cross-validation(k-fold, where k = 10 is most popular)
- 将数据划分为k个不相交的子集 Partition data into k disjoint subsets
- 重复k次:在k-1个子集上训练,在剩下的子集上测试 repeated k times : train on k-1 partitions, test on the remaining one
- 总误差是k次运行的误差总和,取平均 The total error is summing up the errors for k runs
最近邻分类器
Nearest neighbor classifier
To find all training examples thart are relatively similar to the attributes of the test instances.
Requires three things
- The set of stored records
- Distance Metric to compute distance between records
- The value of k, the number of nearest neighbors to retrieve
To classify an unknown record:
- Compute distance to other training records
- Identify k nearest neighbors
- Use class labels of nearest neighbors to determine the class label of unknown record (e.g., by taking majority vote)
Choosing the value of k:
- If k is too small, sensitive to noise points
- If k is too large, neighborhood may include points from other classes
维诺图
Voronoi Diagram
- A Voronoi diagram is a partitioning of a plane into regions based on distance
- These regions are called Voronoi cells 维诺细胞
Compute distance between two points:
- Euclidean distance(欧氏距离): $d(p,q)=\sqrt{\sum_i(p_i-q_i)^2}$
- $p$ 和 $q$ 是空间上的两点
- $p_i$ 和 $q_i$ 是它们的属性
Scaling issues 缩放问题
- Attributes may have to be scaled to prevent distance measures from being dominated by one of the attributes
- Example : height vary from 1.5m to 1.8m; weight vary from 45kg to 150kg; income vary from $10K to $1M
朴素贝叶斯分类器
Naive Bayes classifier
贝叶斯网络
Bayesian Belief Networks