目录
- 第一章:介绍如何使用Python进行数据挖掘
- 一、数据挖掘的过程:
- 二、使用Python和IPython Notebook
- 三、亲和性分析示例
- 1、应用场景:
- 2、实例:推荐商品
- 3、在NumPy中加载数据集
- 4、实现简单的排序规则
- 5、排序找出最佳规则
- 四、分类问题的简单示例
- 1、准备数据集
- 2、实现OneR算法
第一章:介绍如何使用Python进行数据挖掘
课程内容:
1. 数据挖掘简介及其应用场景
2. 搭建Python数据挖掘环境
3. 亲和性分析实例:根据购买力习惯推荐商品
4. (经典)分类问题示例:根据测量结果推测植物的种类
一、数据挖掘的过程:
- 创建数据集
- 表示真实世界中物体的样本
- 描述数据集中样本的特征
- 调整算法
二、使用Python和IPython Notebook
直接安装Anaconda即可,下载地址在https://mirrors.tuna.tsinghua.edu.cn/anaconda/archive/
安装好Anaconda后,可以附带很多库,省去安装的麻烦。其中包括需要使用的scikit-learn库
三、亲和性分析示例
定义:根据样本个体之间的相似度,确定它们的关系的亲疏。
1、应用场景:
- 投放广告
- 推荐商品
- 寻找有亲缘关系的人
2、实例:推荐商品
向上销售:向已经购买商品的顾客推销另一种商品。
商品推荐服务:人们之前经常同时购买的两件商品,以后也很可能同时购买。
转化为算法:
顾客购买商品后,在向他们推荐商品前,先查询一下历史交易数据,找到以往他们购买同样商品的交易数据,看看同时购买了什么,再把它推荐给顾客。
3、在NumPy中加载数据集
# coding:utf8
import numpy as np
dataset_filenane="affinity_dataset.txt"
X=np.loadtxt(dataset_filenane)
print (X[:5])
4、实现简单的排序规则
衡量规则的优劣:
- 支持度:数据集中规则应验的次数,衡量给定规则应验的比例。
- 置信度:衡量规则的准确率,即符合给定条件的所有规则里,跟当前规则结论一致的比例有多少。
# The names of the features, for your reference.
features = ["bread", "milk", "cheese", "apples", "bananas"]
from collections import defaultdict
# Now compute for all possible rules
valid_rules = defaultdict(int)
invalid_rules = defaultdict(int)
num_occurences = defaultdict(int)
#依次对每个个体及个体的每个特征值进行处理
for sample in X:
for premise in range(n_features):
#当没有购买时跳过
if sample[premise] == 0: continue
# Record that the premise was bought in another transaction
#当购买时,num_occurences+1
num_occurences[premise] += 1
#验证规则
for conclusion in range(n_features):
#购买了同一样东西,跳过
if premise == conclusion: # It makes little sense to measure if X -> X.
continue
#买了一种水果,又买了另一种
if sample[conclusion] == 1:
# This person also bought the conclusion item
valid_rules[(premise, conclusion)] += 1
#买了一种水果,没有再买
else:
# This person bought the premise, but not the conclusion
invalid_rules[(premise, conclusion)] += 1
#支持度,规则应验的次数
support = valid_rules
#置信度,需要遍历每条规则
confidence = defaultdict(float)
for premise, conclusion in valid_rules.keys():
"""如想得到真正的结果,至少在分子或分母为有小数"""
confidence[(premise, conclusion)] = float(valid_rules[(premise, conclusion)])/ num_occurences[premise]
"""
for premise, conclusion in confidence:
premise_name = features[premise]
conclusion_name = features[conclusion]
print("Rule: If a person buys {0} they will also buy {1}".format(premise_name, conclusion_name))
print(" - Confidence: {0:.3f}".format(confidence[(premise, conclusion)]))
print(" - Support: {0}".format(support[(premise, conclusion)]))
print("")
"""
def print_rule(premise, conclusion, support, confidence, features):
premise_name = features[premise]
conclusion_name = features[conclusion]
print("Rule: If a person buys {0} they will also buy {1}".format(premise_name, conclusion_name))
print(" - Confidence: {0:.3f}".format(confidence[(premise, conclusion)]))
print(" - Support: {0}".format(support[(premise, conclusion)]))
print("")
#买milk又买apples
premise = 1
conclusion = 3
print print_rule(premise, conclusion, support, confidence, features)
5、排序找出最佳规则
根据支持度和置信度对规则进行排序
- 1.找出支持度最高的规则
- 1.1 先对支持度字典进行排序
- 1.2 排序完成后输出支持度最高的前5条规则
from operator import itemgetter
#items()函数返回返回包含字典所有元素的列表
#itemgetter(1)表示以字典各元素的值作为排序依据
#reverse=True表示降序排列
sorted_support = sorted(support.items(), key=itemgetter(1), reverse=True)
for index in range(5):
print("Rule #{0}".format(index + 1))
(premise, conclusion) = sorted_support[index][0]
print_rule(premise, conclusion, support, confidence, features)
- 2.找出置信度最高的规则
- 2.1 先对置信度字典进行排序
- 2.2 排序完成后输出支持度最高的前5条规则
#items()函数返回返回包含字典所有元素的列表
#itemgetter(1)表示以字典各元素的值作为排序依据
#reverse=True表示降序排列
sorted_confidence = sorted(confidence.items(), key=itemgetter(1), reverse=True)
for index in range(5):
print("Rule #{0}".format(index + 1))
(premise, conclusion) = sorted_confidence[index][0]
print_rule(premise, conclusion, support, confidence, features)
四、分类问题的简单示例
1、准备数据集
import numpy as np
# Load our dataset
from sklearn.datasets import load_iris
#X, y = np.loadtxt("X_classification.txt"), np.loadtxt("y_classification.txt")
dataset = load_iris()
X = dataset.data
y = dataset.target
print(dataset.DESCR)
n_samples, n_features = X.shape
#sepal 萼片 petal 花瓣
将连续型数据转换为类别型
# Compute the mean for each attribute
#离散化
attribute_means = X.mean(axis=0)
#python assert断言是声明其布尔值必须为真的判定,如果发生异常就说明表达示为假。
assert attribute_means.shape == (n_features,)
X_d = np.array(X >= attribute_means, dtype='int')
attribute_means
2、实现OneR算法
根据已有数据中,具有相同特征值得个体最可能属于哪个类别进行分类。
只选取四个特征中分类效果最好的一个用作分类依据。
- 算法首先遍历每个特征值的每一个取值
- 对每一个特征值,统计它在各个类别中的出现次数
- 找到它出现次数最多的类别
- 并统计它在其他类别中出现的次数
- 计算每个特征的错误率(相加)
- 选取错误率最低的特征作为OneR
自己随便写一段,不能实现所以的功能
from collections import defaultdict
n_1 = defaultdict(int)
for i in range(0,int(n_samples)):
for j in range(0,int(n_features)) :
if X_d[i,j]==0:
n_1[(j,0)]+= 1
else:
n_1[(j,1)]+= 1
参考代码
from collections import defaultdict
from operator import itemgetter
# 1. 算法首先遍历每个特征值的每一个取值,并统计在各个类别中出现的次数
def train_feature_value(X, y_true, feature, value):
class_counts = defaultdict(int)
#zip()将可迭代的对象作为参数,将对象中对应的元素打包成一个个元组,然后返回由这些元组组成的列表。
for sample, y in zip(X, y_true):
if sample[feature] == value:
class_counts[y] += 1
#2. class_counts字典进行排序,找到最大值,找到给定特征值的个体在哪个类别中出现的次数最多
sorted_class_counts = sorted(class_counts.items(), key=itemgetter(1), reverse=True)
most_frequent_class = sorted_class_counts[0][0]
#3. 计算该规则的错误率
n_samples = X.shape[1]
error = sum([class_count for class_value, class_count in class_counts.items()
if class_value != most_frequent_class])
#4. 最后返回待预测个体的类别和错误率
return most_frequent_class, error
"""遍历每一个特征值,调用上述函数。可以得到每个特征值所带来的错误率,进而得到该特征的总错误率"""
def train(X, y_true, feature):
# Check that variable is a valid number
n_samples, n_features = X.shape
#n_sample=150,n_features=4
assert 0 <= feature < n_features
# 1. 找出给定特征值的几种不同的取值 #set() 函数创建一个无序不重复元素集
values = set(X[:,feature])
# Stores the predictors预测器 array that is returned
predictors = dict()
errors = []
#2. 遍历选定特征值的每个不同取值,调用前面的方法
#获得每个特征值出现次数最大的类别,并计算错误率
#存储在predictors和errors中
for current_value in values:
most_frequent_class, error = train_feature_value(X, y_true, feature, current_value)
predictors[current_value] = most_frequent_class
errors.append(error)
# 3. Compute the total error(总错误率) of using this feature to classify on
total_error = sum(errors)
return predictors, total_error
"""测试算法"""
# Now, we split into a training and test set
from sklearn.cross_validation import train_test_split
# Set the random state to the same number to get the same results as in the book
random_state = 14
X_train, X_test, y_train, y_test = train_test_split(X_d, y, random_state=random_state)
#计算所有特征值的预测器
# Compute all of the predictors
all_predictors = {variable: train(X_train, y_train, variable) for variable in range(X_train.shape[1])}
errors = {variable: error for variable, (mapping, error) in all_predictors.items()}
# Now choose the best and save that as "model"
# Sort by error
best_variable, best_error = sorted(errors.items(), key=itemgetter(1))[0]
print("The best model is based on variable {0} and has error {1:.2f}".format(best_variable, best_error))
# Choose the bset model
model = {'variable': best_variable,
'predictor': all_predictors[best_variable][0]}
print(model)
"""利用模型,根据特征值对新数据进行分类"""
def predict(X_test, model):
variable = model['variable']
predictor = model['predictor']
y_predicted = np.array([predictor[int(sample[variable])] for sample in X_test])
return y_predicted
"""预测"""
y_predicted = predict(X_test, model)
print(y_predicted)
#比较结果与实际类别
accuracy = np.mean(y_predicted == y_test) * 100
print("The test accuracy is {:.1f}%".format(accuracy))