一 、简介

随机森林是利用多棵树对样本进行训练并预测的一种分类器。

随机选择特征数目,随机选择训练数据,对同一个预测数据取出现次数最多的预测标签为最终预测标签。

随机森林实际上是一种特殊的bagging方法,它将决策树用作bagging中的模型。首先,用bootstrap方法生成m个训练集,然后,对于每个训练集,构造一颗决策树,在节点找特征进行分裂的时候,并不是对所有特征找到能使得指标(如信息增益)最大的,而是在特征中随机抽取一部分特征,在抽到的特征中间找到最优解,应用于节点,进行分裂。随机森林的方法由于有了bagging,也就是集成的思想在,实际上相当于对于样本和特征都进行了采样(如果把训练数据看成矩阵,就像实际中常见的那样,那么就是一个行和列都进行采样的过程),所以可以避免过拟合。

prediction阶段的方法就是bagging的策略,分类投票,回归均值。

二、算法

  1. N来表示训练用例(样本)的个数,M表示特征数目。
  2. 输入特征数目m,用于确定决策树上一个节点的决策结果;其中m应远小于M
  3. N个训练用例(样本)中以有放回抽样的方式,取样N次,形成一个训练集(即bootstrap取样),并用未抽到的用例(样本)作预测,评估其误差。
  4. 对于每一个节点,随机选择m个特征,决策树上每个节点的决定都是基于这些特征确定的。根据这m个特征,计算其最佳的分裂方式。
  5. 每棵树都会完整成长而不会剪枝,这有可能在建完一棵正常树状分类器后会被采用)。

三、优点

1)对于很多种资料,它可以产生高准确度的分类器;

2)它可以处理大量的输入变数;

3)它可以在决定类别时,评估变数的重要性;

4)在建造森林时,它可以在内部对于一般化后的误差产生不偏差的估计;

5)它包含一个好方法可以估计遗失的资料,并且,如果有很大一部分的资料遗失,仍可以维持准确度;

6)它提供一个实验方法,可以去侦测variable interactions;

7)对于不平衡的分类资料集来说,它可以平衡误差;

8)它计算各例中的亲近度,对于数据挖掘、侦测离群点(outlier)和将资料视觉化非常有用;

9)使用上述。它可被延伸应用在未标记的资料上,这类资料通常是使用非监督式聚类。也可侦测偏离者和观看资料;

10)学习过程是很快速的。

四、代码实现

决策树构建等

参考决策树算法



import csv
import numpy as np
import random
import copy
import operator

def loadDataset(filename):
    with open(filename, 'r') as f:
        lines = csv.reader(f)
        data_set = list(lines)
    if filename != 'titanic.csv':
        for i in range(len(data_set)):
            del(data_set[i][0])
    # 整理数据
    for i in range(len(data_set)):
        del(data_set[i][0])
        del(data_set[i][2])
        data_set[i][4] += data_set[i][5]
        del(data_set[i][5])
        del(data_set[i][5])
        del(data_set[i][6])
        del(data_set[i][-1])

    category = data_set[0]

    del (data_set[0])
    # 转换数据格式
    for data in data_set:
        data[0] = int(data[0])
        data[1] = int(data[1])
        if data[3] != '':
            data[3] = float(data[3])
        else:
            data[3] = None
        data[4] = float(data[4])
        data[5] = float(data[5])
    # 补全缺失值 转换记录方式 分类
    for data in data_set:
        if data[3] is None:
            data[3] = 28
        # male : 1, female : 0
        if data[2] == 'male':
            data[2] = 1
        else:
            data[2] = 0
        # age <25 为0, 25<=age<31为1,age>=31为2
        if data[3] < 60: # 但是测试得60分界准确率最高???!!!
            data[3] = 0
        else:
            data[3] = 1
        # sibsp&parcg以2为界限,小于为0,大于为1
        if data[4] < 2:
            data[4] = 0
        else:
            data[4] = 1
        # fare以64为界限
        if data[-1] < 64:
            data[-1] = 0
        else:
            data[-1] = 1
    return data_set, category


def gini(data, i):

    num = len(data)
    label_counts = [0, 0, 0, 0]

    p_count = [0, 0, 0, 0]

    gini_count = [0, 0, 0, 0]

    for d in data:
        label_counts[d[i]] += 1

    for l in range(len(label_counts)):
        for d in data:
            if label_counts[l] != 0 and d[0] == 1 and d[i] == l:
                p_count[l] += 1

    print(label_counts)
    print(p_count)

    for l in range(len(label_counts)):
        if label_counts[l] != 0:
            gini_count[l] = 2*(p_count[l]/label_counts[l])*(1 - p_count[l]/label_counts[l])

    gini_p = 0
    for l in range(len(gini_count)):
        gini_p += (label_counts[l]/num)*gini_count[l]

    print(gini_p)

    return gini_p


def get_best_feature(data, category):
    if len(category) == 2:
        return 1, category[1]

    feature_num = len(category) - 1
    data_num = len(data)

    feature_gini = []

    for i in range(1, feature_num+1):
        feature_gini.append(gini(data, i))

    min = 0

    for i in range(len(feature_gini)):
        if feature_gini[i] < feature_gini[min]:
            min = i

    print(feature_gini)
    print(category)
    print(min+1)
    print(category[min+1])

    return min+1, category[min + 1]


def majority_cnt(class_list):
    class_count = {}
    # 统计class_list中每个元素出现的次数
    for vote in class_list:
        if vote not in class_count:
            class_count[vote] = 0
        class_count[vote] += 1
        # 根据字典的值降序排列
        sorted_class_count = sorted(class_count.items(), key=operator.itemgetter(1), reverse=True)
    return sorted_class_count[0][0]


class Node(object):
    def __init__(self, item):
        self.name = item
        self.lchild = None
        self.rchild = None


def creat_tree(data, labels, feature_labels=[]):
# 三种结束情况
    # 取分类标签(survivor or death)
    class_list = [exampel[0] for exampel in data]

    if class_list == []:
        return Node(0)
    # 如果类别完全相同则停止分类
    if class_list.count(class_list[0]) == len(class_list):
        return Node(class_list[0])
    # 遍历完所有特征时返回出现次数最多的类标签
    if len(data[0]) == 1:
        return Node(majority_cnt(class_list))

    # 最优特征的标签
    best_feature_num, best_feature_label = get_best_feature(data, labels)

    feature_labels.append(best_feature_label)

    node = Node(best_feature_label)

    ldata = []
    rdata = []

    for d in data:
        if d[best_feature_num] == 1:
            del(d[best_feature_num])
            ldata.append(d)
        else:
            del(d[best_feature_num])
            rdata.append(d)

    labels2 = copy.deepcopy(labels)
    del(labels2[best_feature_num])

    tree = node
    tree.lchild = creat_tree(ldata, labels2, feature_labels)
    tree.rchild = creat_tree(rdata, labels2, feature_labels)

    return tree


def breadth_travel(tree):
    """广度遍历"""
    queue = [tree]
    while queue:
        cur_node = queue.pop(0)
        print(cur_node.name, end=" ")
        if cur_node.lchild is not None:
            queue.append(cur_node.lchild)
        if cur_node.rchild is not None:
            queue.append(cur_node.rchild)
    print()


def prediction(t_tree, test, labels):
    result = []

    for data in test:
        l = []
        l = copy.deepcopy(labels)
        tree = t_tree
        for i in range(len(labels)):
            if tree.name == 1 or tree.name == 0:
                result.append(tree.name)
                break
            j = 1
            while j:
                if tree.name == l[j]:
                    break
                j += 1

            if data[j] == 1:
                tree = tree.lchild
            else:
                tree = tree.rchild
            del(l[j])
            del(data[j])
    return result



随机森林的预测代码



def new_pre(t_test, labels, tree):
    result = []
    r = []

    for i in range(len(t_test)):
            label = []
            label = copy.deepcopy(labels[i])
            print(label)
            breadth_travel(tree[i])
            r.append(prediction(tree[i], t_test[i], label))
    rr = []
    for i in range(len(r[0])):
        rr.append([])

    for i in range(len(rr)):
        for j in range(len(r)):
            rr[i].append(r[j][i])

    print(rr)

    for i in range(len(rr)):
        result.append(majority_cnt(rr[i]))
    return result



读取数据



test_set, category = loadDataset('titanic_test.csv')
    data_set, category = loadDataset('titanic.csv')



生成

随机选取三个特征,生成十棵树



tree_num = 10
    bootstrapping = []
    b_category = []
    b_test = []
    for i in range(tree_num):
        b_category.append(copy.deepcopy(category))
        b_test.append(copy.deepcopy(test_set))
        bootstrapping.append([])
        for j in range(len(data_set)):
            bootstrapping[i].append(copy.deepcopy(data_set[int(np.floor(np.random.random() * len(data_set)))]))

    print(test_set)
    print(b_test)
    # m = 3,此处选取随机去掉的两个特征
    n = 2
    n_num_category = []
    for i in range(tree_num):
        n_num_category.append(random.sample(range(1, 5), n))

    print(n_num_category)

    print(b_category)

    for i in range(len(b_category)):
        for j in range(n):
            b_category[i][n_num_category[i][j]] = 0
        for j in range(n):
            b_category[i].remove(0)


        for k in range(len(b_test[i])):
            for j in range(2):
                b_test[i][k][n_num_category[i][j]] = -1
        for k in range(len(b_test[i])):
            for j in range(2):
                b_test[i][k].remove(-1)


        for k in range(len(bootstrapping[i])):
            for j in range(2):
                bootstrapping[i][k][n_num_category[i][j]] = -1
        for k in range(len(bootstrapping[i])):
            for j in range(2):
                bootstrapping[i][k].remove(-1)

    print(b_category)
    print(b_test)
    print(bootstrapping)

    b2_category = copy.deepcopy(b_category)

    my_tree = []

    for i in range(tree_num):
        my_tree.append(creat_tree(bootstrapping[i], b2_category[i]))

    for i in range(tree_num):
        print(b_category[i])
        breadth_travel(my_tree[i])



计算准确率



result = new_pre(b_test, b_category, my_tree)

    print(result)

    counts = 0

    for i in range(len(test_set)):
        if test_set[i][0] == result[i]:
            counts += 1

    print(counts)

    print(counts/len(test_set))



五、结果展示及分析

准确率:63.7%

低于其他方法的结果,考虑是由于本身特征较少,再随机选取特征之后生成的决策树效果较差,导致最后集成的效果也变差了。