杨记

碎片化学习令人焦虑,系统化学习使人进步

0%

数据挖掘入门

数据挖掘概述

数据挖掘的任务

Task of Data Mining(数据挖掘)

  • Predicive task(预测任务)Use some variables to predict unknown or future values of other variables
    • Classification 分类问题
    • Regression 回归问题
    • Deviation Detection 异常检测问题
  • Descriptive task(描述任务)Find human-interpretable patterns that describe the underlying relationships in data
    • Clustering 聚类分析
    • Association rules discovery 关联规则分析
    • Sequential Pattern Discovery 有序模式挖掘

Classification

Find a model for class attribute as a function of the values of other attributes

used for discrete target variables 离散变量

应用:信用卡交易(合法|不合法),肿瘤细胞(良性|恶性),Image object recognition(图像目标识别)

Regression

used for continuous target variables 连续变量

应用:预测风速,预测股票趋势,

Deviation/Anomaly Detection

Detect significant deviations from normal behavior

应用:信用卡欺诈检测,网络攻击

Clustering

应用:Document Clustering (文档分类),图片聚类

Association Rule Discovery

应用:精准销售

Sequential Pattern Discovery

数据

数据集(data set)A data set can be viewed as a collection of data objects. Object is also known as record, point, or instance

数据对象(data object)data objects are described by a number of attributes. An attribute is a property or characteristic of an object. Attribute is also known as variable, field, characteristic, or feature.

属性也称为变量,字段,特征值

数据属性类型

属性类型

分类一:

  • 定性属性(Qualitative):区分对象,不具有数的性质
    • 标称(Nominal):如:学号;眼睛的颜色、邮政编码
    • 序数(Ordinal):如:排名(优秀,中等,差)
  • 定量属性(Quantitative):具有数的大部分性质
    • 区间(Interval):如:日期
    • 比值(Ratio):如:年龄

对应四种操作:

  1. 显著性(Distinctness): $=$ $\neq$
  2. 有序性(Order): $<$ $>$
  3. $+$ $-$
  4. $*$ $/$

分类二:

离散属性 Discrete attribute:邮政编码

连续属性 Continuous Attribute:温度,身高,体重

数据类型

分三大类:

  • 记录 Record
    • Data Matrix 数据矩阵
    • Document Data 文档数据
    • Transaction Data 交易数据
  • 图像 Graph
    • World Wide Web 网页数据
    • Molecular Structures 分子结构数据
  • 有序 Ordered
    • Temporal Data 时间数据
    • Genetic Sequence Data 基因序列数据

数据预处理

数据预处理(Data Preprocessing)是数据挖掘前的准备工作

Data is not perfect

  • Nosie 噪声
  • missing values 数据缺失
  • Inconsistent 不一致
  • duplicate data 重复数据

数据预处理的方法:

  1. Aggregation 聚集
  2. Sampling 采样
  3. Dimensionality Reduction 维度降低
  4. Feature subset selection 特征子集选取
  5. Feature creation 特征创建
  6. Attribute Transformation 属性变换

1和2主要用于减少数据的数量;3和4主要用于减少数据的属性;5和6主要用于创建或改变数据的属性

Aggregation

  • 优点:Data reductionMore stable data
  • 缺点:lost of interesting details

Sampling

  • 原则:采样得到的样品要有代表性
  • 要考虑样本容量 sampling sizes

1)简单随机采样 Simple Random Sampling

  • Sampling without replacement 不放回采样
  • Sampling with replacement

2)分层采样 Stratified sampling

Dimensionality Reduction

目的:

  • Avoid curse of dimensionality 避免维度灾难
  • Reduce amount of time and memory required by data mining algorithms 减少数据挖掘所需的时间和内存
  • Allow data to be more easily visualized 使数据更容易可视化

法1)Principal Components Analysis(PCA) 主成分分析

PCA is a linear algebra technique that projecting the data from ist original high-dimensional space into a lower-dimensional space.

image-20220525214328543

Feature Subset Selection

使用场景:数据中存在 Redundant features 冗余特征Irrelevant features 不相关特征

Feature Creation

三种常见的方法:

1)Feature Extraction 特征提取

  • example:图片 —> 边缘特征

2)Feature Construction 属性创造

  • example:体积和质量 —> 密度

3)Mapping Data to New Space 把数据映射至新的空间

  • Fourier transform 傅里叶变换

Attribute Transformation

A function that maps the entire set of values of a given attribute to a new set of replacement values such that each old value can be identified with one of the new values.

1)Simple functions 简单函数变换

  • example:$x^k$ 、 $log(x)$ 、 $e^x$ 、 $|x|$
  • variable transformations should be applied with caution since they change the nature of the data. 应谨慎应用变量转换,因为它们会改变数据的性质

2)Standardization / Normalization 规范化 / 归一化

  • the magnitude and value of a variable is different
  • $X’=(x-MinValue)/(MaxValue-MinValue)$

分类

1)建立分类模型

2)Apply Model to Test Data

分类主要涉及的两个过程:Induction归纳Deduction演绎

image-20220526101645920

决策树 Decision Tree

建立决策树的算法有:

  • Hunt‘s Algorithm 最简单的
  • CART
  • ID3, C4.5
  • SLIQ, SPRINT

Decision Tree has three types of node 决策树有三种结点:

  • 根结点 A root node —> no incoming links and zero or more outgoing links
  • 内部结点 Internal nodes —> one incoming link and two or more outgoing links
  • 叶节点或终端结点 Leaf or terminal nodes —> one incoming link and no outgoing links

不纯度 Impurity

image-20220526105410518

对单个结点的不纯度的度量的三种方法

1)Gini Index 基尼指数 $GINI(t) = 1 - \sum_j[p(j|t)]^2$

2)Entropy $Entropy(t) = - \sum_j p(j|t)\log_2p(j|t)$

3)Misclassification error 类别误差 $Error(t) = 1 - \max P(i|t)$

注)$p(j|t)$ 表示在结点$t$上属于类别$j$的概率

基尼指数示例:

image-20220526111140562

Hunt’s Algorithm

Hunt’s Algotithm 按下面的步骤递归

step 1)如果数据 Dt 属于相同的类别 yt ,则 t 是一个叶节点yt

step 2)如果数据 Dt 来自多个类别,使用属性将数据拆分成更小的子集。递归应用

image-20220526103810987

Hunt’s algorithm is a generic procedure for growing decision trees in a greedy fashion 以贪心策略生长决策树的通用过程

Hunt’s algorithm grows a decision tree by making a series of locally optimum decisions, without considering the overall optimum. 找出局部最优,没有考虑全局最优

There could be more than one tree that fits the same data! 决策树不唯一

实现Hunt’s algorithm三个关键问题

  1. 最优特征选择 determine the best split
  2. 分支问题 how many branches can be generated
  3. 停止生长问题 determine when to stop splitting
最优特征选择
  • 计算结点分割前的不纯度P
  • 计算结点分割后的不纯度M
    • 计算每个子节点的不纯度,再加权平均weighted impurity)得到M
    • 基尼指数:$\Large GINI_{split}=\sum^k_{i=1}\frac{n_i}{n}GINI(i)$
    • 熵:$\Large Entropy_{split}=\sum^k_{i=1}\frac{n_i}{n}Entropy(i)$
    • 类别误差:$\Large Error_{split}=\sum^k_{i=1}\frac{n_i}{n}Error(i)$
    • $n_i$为子节点的数据数量,$n$为父节点的数据数量
  • 选择一个能产生最大增益highest gain)的属性分割,最大增益=P-M
    • 即选择M的值最小的一个

image-20220526112310944

image-20220526112623333

分支问题

依赖于属性的类型:NominalOrdinalContinuous

依赖于属性的值:2-way split 2路划分, Multi-way split 多路划分

1)标称属性:

image-20220526114635873

image-20220526114807310

2)有序属性:划分时不能违背有序特征

image-20220526115026058

3)连续属性:选择合适的分界点来划分

image-20220526115239716

如何选择合适的分界点:四个步骤

  1. 排序 Sort the attribute on values
  2. 分界点 Give split poisons
  3. 计算每个分界点的基尼指数 Linearly scan these values, each time updating the count matrix and computing gini index
  4. 选择最低GINI的分割点 Choose the split position that has the least gini index

示例:

image-20220526125530186

1)排序

image-20220526125613545

2)给出分界点

image-20220526125648323

3)计算对应的GINI值

image-20220526125742035

4)选择最优的或局部最优的

image-20220526125816910

停止生长问题

Stop expanding a node when all the records belong to the same class

提前终止条件 Early termination (the number of records below minimum threshold)

模型过拟合

image-20220526130312490

A good classification model must not only fit the training data well, it must alse accurately classify record it has never seen before.

过拟合 overfittingwhen model is too complex, training error continues to decrease and test error begins to increase

欠拟合 underfittingwhen model is too simple, both training and test errors are large

image-20220526130902485

过拟合的原因:

1)噪声 Noise

image-20220526131019688

2)训练集有限 Insufficient Examples

image-20220526131222434

3)过高的模型复杂度 High Model Complexity

image-20220526131534404

image-20220526131853206

image-20220526132041253

image-20220526132146437

奥卡姆剃刀原理 Occam’s Razor

Given two models of similar errors, one should prefer the simpler model over the more complex model 相似的错误率,应选择简单的模型

one should include model complexity when evaluating a model 评估模型时应考虑模型的复杂度

过拟合的情况下,training error不再能对 模型在测试数据上的表现 提供很好的评估效果,需要新的方法进行错误率估计

Methods for estimating generalization errors

1)悲观误差估计 Pessimistic error estimate

  • $e’(T) = e(T) + N * 0.5/Nt$
    • $N$ —> 叶节点数
    • $Nt$ —> 训练集的数量
    • $e(T)$ —> 训练误差
    • $e’(T)$ —> 泛化误差
  • eg:For a tree with 30 leaf nodes and 10 errors on training (out of 1000 instances) ==> $Training error = 10/1000=1%$ $Generalization error = 1\% + 30*0.5/1000=2.5%$

2)最小描述长度原则 Minimum Description Length(MDL)

  • $Cost(Model, Data)=Cost(Model)+Cost(Data|Model)$
    • $Cost(Model)$ —> encoding the model = attribute encoding + leaf encoding 模型编码
    • $Cost(Data|Model)$ —> encoding the mislabeled records 错误的分类记录所需要的编码
  • image-20220526134500490
  • image-20220526134709067
  • image-20220526134933005

模型评估

Model Evaluation

模型评估常用指标:

1)Confusion Matrix 混淆矩阵

image-20220526135412149

模型准确度:$\Large Accuracy=\frac{a+d}{a+b+c+d}=\frac{TP+TN}{TP+TN+FP+FN}$

eg:

image-20220526135611733

局限性:

image-20220526135732404

2)查准率或精度 Precision

image-20220526191312814

  • Precision:exactness what % of tuples that the classifier labeled as positive are actually positive

  • Recall:completeness what % of positive tuples did the classifier label as positive

  • F measure(F1 or F-score)harmonic mean of precision and recall

image-20220526192449581

可靠模型评估的方法 Methods for Performance Evaluation

1)保持方法 Holdout method

  • 将数据集划分成不相交的两部分 Given data is randomly partitioned into two independent sets
    • 训练集 Training set(2/3的data set)for model construction
    • 测试集 Test set(1/3的data set)for accuracy estimation

2)随机子抽样 Random subsampling a variation of holdout

  • 重复k次保持方法,模型的精度是由k次平均值 Repeat holdout k times, accuracy = avg. of the accuracies obtained

3)交叉验证 Cross-validation(k-fold, where k = 10 is most popular)

  • 将数据划分为k个不相交的子集 Partition data into k disjoint subsets
  • 重复k次:在k-1个子集上训练,在剩下的子集上测试 repeated k times : train on k-1 partitions, test on the remaining one
  • 总误差是k次运行的误差总和,取平均 The total error is summing up the errors for k runs

image-20220526194355228

最近邻分类器

Nearest neighbor classifier

To find all training examples thart are relatively similar to the attributes of the test instances.

Requires three things

  • The set of stored records
  • Distance Metric to compute distance between records
  • The value of k, the number of nearest neighbors to retrieve

To classify an unknown record:

  • Compute distance to other training records
  • Identify k nearest neighbors
  • Use class labels of nearest neighbors to determine the class label of unknown record (e.g., by taking majority vote)

image-20220618155705686

Choosing the value of k:

  • If k is too small, sensitive to noise points
  • If k is too large, neighborhood may include points from other classes

image-20220618155948069

维诺图

Voronoi Diagram

image-20220618160456807

  • A Voronoi diagram is a partitioning of a plane into regions based on distance
  • These regions are called Voronoi cells 维诺细胞

Compute distance between two points:

  • Euclidean distance(欧氏距离): $d(p,q)=\sqrt{\sum_i(p_i-q_i)^2}$
    • $p$ 和 $q$ 是空间上的两点
    • $p_i$ 和 $q_i$ 是它们的属性

Scaling issues 缩放问题

  • Attributes may have to be scaled to prevent distance measures from being dominated by one of the attributes
  • Example : height vary from 1.5m to 1.8m; weight vary from 45kg to 150kg; income vary from $10K to $1M

朴素贝叶斯分类器

Naive Bayes classifier

贝叶斯网络

Bayesian Belief Networks

关联规则

image-20220619231627825

image-20220619231634230

欢迎关注我的其它发布渠道