数据挖掘入门

数据挖掘概述

数据挖掘的任务

Task of Data Mining（数据挖掘）

Predicive task（预测任务）：Use some variables to predict unknown or future values of other variables
- Classification 分类问题
- Regression 回归问题
- Deviation Detection 异常检测问题
Descriptive task（描述任务）：Find human-interpretable patterns that describe the underlying relationships in data
- Clustering 聚类分析
- Association rules discovery 关联规则分析
- Sequential Pattern Discovery 有序模式挖掘

Classification

Find a model for class attribute as a function of the values of other attributes

used for discrete target variables 离散变量

应用：信用卡交易（合法|不合法），肿瘤细胞（良性|恶性），Image object recognition（图像目标识别）

Regression

used for continuous target variables 连续变量

应用：预测风速，预测股票趋势，

Deviation/Anomaly Detection

Detect significant deviations from normal behavior

应用：信用卡欺诈检测，网络攻击

Clustering

应用：Document Clustering （文档分类），图片聚类

Association Rule Discovery

应用：精准销售

Sequential Pattern Discovery

数据

数据集（data set）：A data set can be viewed as a collection of data objects. Object is also known as record, point, or instance

数据对象（data object）：data objects are described by a number of attributes. An attribute is a property or characteristic of an object. Attribute is also known as variable, field, characteristic, or feature.

属性也称为变量，字段，特征值

数据属性类型

属性类型：

分类一：

定性属性（Qualitative）：区分对象，不具有数的性质
- 标称（Nominal）：如：学号；眼睛的颜色、邮政编码
- 序数（Ordinal）：如：排名（优秀，中等，差）
定量属性（Quantitative）：具有数的大部分性质
- 区间（Interval）：如：日期
- 比值（Ratio）：如：年龄

对应四种操作：

显著性（Distinctness）： $=$ $\neq$
有序性（Order）： $<$ $>$
$+$ $-$
$*$ $/$

分类二：

离散属性 Discrete attribute：邮政编码

连续属性 Continuous Attribute：温度，身高，体重

数据类型

分三大类：

记录 Record
- Data Matrix 数据矩阵
- Document Data 文档数据
- Transaction Data 交易数据
图像 Graph
- World Wide Web 网页数据
- Molecular Structures 分子结构数据
有序 Ordered
- Temporal Data 时间数据
- Genetic Sequence Data 基因序列数据

数据预处理

数据预处理（Data Preprocessing）是数据挖掘前的准备工作

Data is not perfect：

Nosie 噪声
missing values 数据缺失
Inconsistent 不一致
duplicate data 重复数据

数据预处理的方法：

Aggregation 聚集
Sampling 采样
Dimensionality Reduction 维度降低
Feature subset selection 特征子集选取
Feature creation 特征创建
Attribute Transformation 属性变换

1和2主要用于减少数据的数量；3和4主要用于减少数据的属性；5和6主要用于创建或改变数据的属性

Aggregation

优点：Data reduction ， More stable data
缺点：lost of interesting details

Sampling

原则：采样得到的样品要有代表性
要考虑样本容量 sampling sizes

1）简单随机采样 Simple Random Sampling

Sampling without replacement 不放回采样
Sampling with replacement

2）分层采样 Stratified sampling

Dimensionality Reduction

目的：

Avoid curse of dimensionality 避免维度灾难
Reduce amount of time and memory required by data mining algorithms 减少数据挖掘所需的时间和内存
Allow data to be more easily visualized 使数据更容易可视化

法1）Principal Components Analysis（PCA） 主成分分析

PCA is a linear algebra technique that projecting the data from ist original high-dimensional space into a lower-dimensional space.

Feature Subset Selection

使用场景：数据中存在 Redundant features 冗余特征 和 Irrelevant features 不相关特征

Feature Creation

三种常见的方法：

1）Feature Extraction 特征提取

example：图片 —> 边缘特征

2）Feature Construction 属性创造

example：体积和质量 —> 密度

3）Mapping Data to New Space 把数据映射至新的空间

Fourier transform 傅里叶变换

Attribute Transformation

A function that maps the entire set of values of a given attribute to a new set of replacement values such that each old value can be identified with one of the new values.

1）Simple functions 简单函数变换

example：$x^k$ 、 $log(x)$ 、 $e^x$ 、 $|x|$
variable transformations should be applied with caution since they change the nature of the data. 应谨慎应用变量转换，因为它们会改变数据的性质

2）Standardization / Normalization 规范化 / 归一化