在网上看到很多用朴素贝叶斯算法来实现垃圾邮件分类的,有直接调用库的,也有自己写的。出于对贝叶斯算法的复习,我也想用贝叶斯算法写写邮件识别,做一个简单的识别系统。

一.开发环境

Python3.6,邮件包(包含正常邮件和垃圾邮件各25封)

二.贝叶斯原理简介

我们有一个测试集,通过统计测试集中各个词的词频,(w1,w2,w3,...wn).通过这个词向量来判断是否为垃圾邮件的概率,即求

P(s|w),w=(w1,w2,...,wn)

大意为,已知wi存在该邮件中,判断其是否为垃圾邮件。

根据贝叶斯公式和全概率公式,

P(s|w1,w2,...,wn) 
=P(s,w1,w2,...,wn)/P(w1,w2,...,wn) 
=P(w1,w2,...,wn|s)P(s)/P(w1,w2,...,wn) 
=P(w1,w2,...,wn|s)P(s)/P(w1,w2,...,wn|s)⋅p(s)+P(w1,w2,...,wn|s′)⋅p(s′)

根据朴素贝叶斯的条件独立假设,并设先验概率P(s)=P(s′)=0.5,上式可化为:

python 垃圾邮件 roc曲线 垃圾邮件判断算法_python 垃圾邮件 roc曲线

式1


再利用贝叶斯P(wj|s)=P(s|wj)⋅P(wj)/P(s),式子化为

python 垃圾邮件 roc曲线 垃圾邮件判断算法_python 垃圾邮件 roc曲线_02


最终,得到式2,也就是说要用式2来计算P(s|w),之所以不用式1,是因为s’不好计算,通过式2可以方便地计算联乘。

三.算法流程

(1)对训练集进行分词,并用停用表(人工创建的非法字符集)进行简单过滤,得到clean_word列表;
(2)分别保存正常邮件与垃圾邮件中出现的词有多少邮件出现该词,得到两个词典。例如词”price”在25封正常邮件中出现了2次,在25封垃圾邮件中出现了15次;
(3)对测试集中的每一封邮件做同样的处理,并计算得到每个词的P(s|w),在计算过程中,若该词只出现在垃圾邮件的词典中,则令P(w|s′)=0.01,反之亦然;若都未出现,则令P(s|w)=0.4。PS.这里做的几个假设基于前人做的一些研究工作得出的。 
(4)对得到的每封邮件中每个词利用式2计算概率,若概率>阈值α(这里设为0.5),则判为垃圾邮件,否则判为正常邮件。

四.代码实现

# -*- coding: utf-8 -*-
# @Date     : 2018-05-17 09:29:13
# @Author   : lrk
# @Language : Python3.6

# from fwalker import fun
# from reader import readtxt
import os


def readtxt(path,encoding):
    with open(path, 'r', encoding = encoding) as f:
        lines = f.readlines()
    return lines

def fileWalker(path):#遍历所有文件
    fileArray = []
    for root, dirs, files in os.walk(path):
        for fn in files:
            eachpath = str(root+'\\'+fn)
            fileArray.append(eachpath)
    return fileArray

def email_parser(email_path):#得到所有词的列表
    punctuations = """,.<>()*&^%$#@!'";~`[]{}|、\\/~+_-=?"""
    content_list = readtxt(email_path, 'utf8')
    content = (' '.join(content_list)).replace('\r\n', ' ').replace('\t', ' ')
    clean_word = []
    for punctuation in punctuations:
        content = (' '.join(content.split(punctuation))).replace('  ', ' ')
        clean_word = [word.lower()
                      for word in content.split(' ') if len(word) > 2]
    return clean_word


def get_word(email_file):
    word_list = []
    word_set = []
    punctuations = """,.<>()*&^%$#@!'";~`[]{}|、\\/~+_-=?"""
    email_paths = fileWalker(email_file)
    for email_path in email_paths:
                clean_word = email_parser(email_path)
        word_list.append(clean_word)
        word_set.extend(clean_word)
    return word_list, set(word_set)


def count_word_prob(email_list, union_set):
    word_prob = {} #建立一个字典,统计每一个词的词频,如出现,计数。未出现,即为0.01
    for word in union_set:
        counter = 0
        for email in email_list:
            if word in email:
                counter += 1  #在所有文件中出现的次数
            else:
                continue
        prob = 0.0
        if counter != 0:
            prob = counter/len(email_list)
        else:
            prob = 0.01
        word_prob[word] = prob
    return word_prob #大概意思是,谁谁,出现在邮件中的次数/所有邮件数
def filter(ham_word_pro, spam_word_pro, test_file):
    test_paths = fileWalker(test_file)
    for test_path in test_paths:
        email_spam_prob = 0.0
        spam_prob = 0.5
        ham_prob = 0.5
        file_name = test_path.split('\\')[-1]
        prob_dict = {}
        words = set(email_parser(test_path))  #当前测试集中某一邮件分词集合
        for word in words:
            Psw = 0.0
            if word not in spam_word_pro:
                Psw = 0.4 #如果词语未出现在所有邮件中,则记为0.4
            else:
                Pws = spam_word_pro[word]#该词在垃圾邮件中的频率
                Pwh = ham_word_pro[word] #该词在正常邮件中的频率
                Psw = spam_prob*(Pws/(Pwh*ham_prob+Pws*spam_prob))#该词的贝叶斯概率
            prob_dict[word] = Psw  #加入到字典中
        numerator = 1
        denominator_h = 1
        for k, v in prob_dict.items():
            numerator *= v
            denominator_h *= (1-v)
        email_spam_prob = round(numerator/(numerator+denominator_h), 4)
        if email_spam_prob > 0.5:
            print(file_name, 'spam', 'psw is',email_spam_prob)
        else:
            print(file_name, 'ham', 'psw is',email_spam_prob)
        print(prob_dict)
        print('----------------------------------------我是分界线---------------------------------')
        # break

def main():
    ham_file = r'D:\Program Files (x86)\python文件\贝叶斯垃圾邮件分类\email\ham'
    spam_file = r'D:\Program Files (x86)\python文件\贝叶斯垃圾邮件分类\email\spam'
    test_file = r'D:\Program Files (x86)\python文件\贝叶斯垃圾邮件分类\email\test'
    ham_list, ham_set = get_word(ham_file)
    spam_list, spam_set = get_word(spam_file)
    union_set = ham_set | spam_set
    ham_word_pro = count_word_prob(ham_list, union_set)
    spam_word_pro = count_word_prob(spam_list, union_set)
    # print(ham_set)
    filter(ham_word_pro, spam_word_pro, test_file)


if __name__ == '__main__':
    main()

五.结果展示


ham_24.txt 判定为ham psw is 0.0005
{'there': 0.052132701421800945, 'latest': 0.18032786885245902, 'the': 0.15827338129496402, 'will': 0.23404255319148934, '10:00': 0.4}
-----------------------------------------------------------
ham_3.txt 判定为ham psw is 0.0
{'john': 0.4, 'with': 0.1692307692307692, 'get': 0.2894736842105263, 'had': 0.4, 'fans': 0.4, 'went': 0.18032786885245902, 'that': 0.1692307692307692, 'giants': 0.4, 'food': 0.4, 'time': 0.3793103448275862, 'stuff': 0.18032786885245902, 'rain': 0.4, 'cold': 0.18032786885245902, 'bike': 0.4, 'they': 0.0990990990990991, 'free': 0.8474576271186441, 'going': 0.06832298136645963, 'what': 0.18032786885245902, 'the': 0.15827338129496402, 'email': 0.06832298136645963, 'all': 0.55, 'some': 0.0990990990990991, 'when': 0.4, 'got': 0.0990990990990991, 'train': 0.4, 'done': 0.18032786885245902, 'computer': 0.4, 'drunk': 0.4, 'museum': 0.4, 'take': 0.18032786885245902, 'game': 0.4, 'yesterday': 0.4, 'and': 0.48458149779735676, 'not': 0.0990990990990991, 'there': 0.052132701421800945, 'same': 0.4, 'talked': 0.4, 'had\n': 0.4, 'we\n': 0.4, 'was': 0.052132701421800945, 'about': 0.06832298136645963, 'riding': 0.4, 'are': 0.03536977491961415}
-----------------------------------------------------------
ham_4.txt 判定为ham psw is 0.0
{'having': 0.4, 'like': 0.06832298136645963, 'jquery': 0.4, 'website': 0.4, 'away': 0.4, 'using': 0.18032786885245902, 'working': 0.18032786885245902, 'running': 0.4, 'the': 0.15827338129496402, 'from': 0.14864864864864866, 'would': 0.18032786885245902, 'launch': 0.4, 'you': 0.34374999999999994, 'and': 0.48458149779735676, 'right': 0.18032786885245902, 'not': 0.0990990990990991, 'too': 0.18032786885245902, 'been': 0.18032786885245902, 'plugin': 0.4, 'prototype': 0.4, 'used': 0.9174311926605505, 'far': 0.4, 'jqplot': 0.4, 'think': 0.18032786885245902}
-----------------------------------------------------------
spam_11.txt 判定为spam psw is 1.0
{'natural': 0.9652509652509652, 'length': 0.9652509652509652, 'harderecetions\n': 0.9652509652509652, 'proven': 0.8474576271186441, 'works': 0.8474576271186441, 'guaranteeed': 0.8474576271186441, 'betterejacu1ation': 0.9652509652509652, 'that': 0.1692307692307692, 'ofejacu1ate\n': 0.9652509652509652, 'explosive': 0.9652509652509652, 'naturalpenisenhancement': 0.8474576271186441, 'yourpenis': 0.9652509652509652, 'safe\n': 0.8474576271186441, 'thickness': 0.9652509652509652, 'designed': 0.859375, 'the': 0.15827338129496402, 'inches': 0.970873786407767, 'intenseorgasns\n': 0.9652509652509652, 'amazing': 0.9652509652509652, '100': 0.9779951100244498, 'everything': 0.9652509652509652, 'gain': 0.9652509652509652, 'incredib1e': 0.9652509652509652, 'have': 0.6311475409836066, 'control\n': 0.9652509652509652, 'gains': 0.9652509652509652, 'experience': 0.970873786407767, 'you': 0.34374999999999994, 'doctor': 0.9652509652509652, 'and': 0.48458149779735676, 'volume': 0.9652509652509652, 'permanantly\n': 0.9652509652509652, 'rock': 0.9652509652509652, 'moneyback': 0.8474576271186441, 'increase': 0.9652509652509652, 'endorsed\n': 0.9652509652509652, 'herbal': 0.9652509652509652}
-----------------------------------------------------------
spam_14.txt 判定为spam psw is 0.9991
{'viagranoprescription': 0.4, 'per': 0.4, 'pill\n': 0.8474576271186441, 'accept': 0.5499999999999999, 'canadian': 0.9174311926605505, '25mg': 0.4, 'from': 0.14864864864864866, '100mg': 0.4, 'pharmacy\n': 0.4, 'here': 0.7096774193548386, 'certified': 0.4, 'amex': 0.4, 'visa': 0.9569377990430622, 'worldwide': 0.4, 'brandviagra': 0.4, 'needed': 0.4, 'femaleviagra': 0.4, 'check': 0.5499999999999999, 'buyviagra': 0.4, 'buy': 0.9779951100244498, 'delivery': 0.8474576271186441, '50mg': 0.4}
-----------------------------------------------------------
spam_17.txt 判定为ham psw is 0.0
{'based': 0.4, 'income': 0.4, 'transformed': 0.4, 'finder': 0.4, 'more': 0.4489795918367347, 'great': 0.4, 'your': 0.2588235294117647, 'success': 0.4, 'experts': 0.4, 'financial': 0.4, 'from': 0.14864864864864866, 'life': 0.4, 'let': 0.0990990990990991, 'learn': 0.4, 'earn': 0.4, 'this': 0.0267639902676399, 'knocking': 0.4, 'rude': 0.4, 'work': 0.0990990990990991, 'here': 0.7096774193548386, 'business': 0.4, 'chance': 0.4, 'door': 0.18032786885245902, 'you': 0.34374999999999994, 'can': 0.03536977491961415, 'and': 0.48458149779735676, 'opportunity': 0.4, 'home': 0.4, 'don抰': 0.4, 'find\n': 0.4}
-----------------------------------------------------------
spam_18.txt 判定为spam psw is 1.0
{'291': 0.4, '30mg': 0.9174311926605505, 'freeviagra': 0.4, 'codeine': 0.9174311926605505, 'pills': 0.9433962264150942, 'wilson': 0.4, 'net': 0.4, 'price': 0.4, 'the': 0.15827338129496402, '492': 0.4, '120': 0.9569377990430622, 'most': 0.8474576271186441, '396': 0.4, '00\n': 0.9433962264150942, '156': 0.4, 'competitive': 0.4}
-----------------------------------------------------------
spam_19.txt 判定为spam psw is 1.0
{'wallets\n': 0.8474576271186441, 'bags\n': 0.8474576271186441, 'get': 0.2894736842105263, 'order\n': 0.8474576271186441, 'courier:': 0.8474576271186441, 'more': 0.4489795918367347, 'vuitton': 0.8474576271186441, 'watchesstore\n': 0.8474576271186441, 'hermes': 0.8474576271186441, 'shipment': 0.8474576271186441, 'brands\n': 0.8474576271186441, 'your': 0.2588235294117647, 'dhl': 0.8474576271186441, 'arolexbvlgari': 0.8474576271186441, 'cartier': 0.8474576271186441, 'fedex': 0.9174311926605505, 'tiffany': 0.8474576271186441, 'oris': 0.8474576271186441, 'recieve': 0.8474576271186441, 'all': 0.55, 'discount': 0.8474576271186441, 'famous': 0.8474576271186441, 'speedpost\n': 0.8474576271186441, 'will': 0.23404255319148934, '100': 0.9779951100244498, 'dior': 0.8474576271186441, 'gucci': 0.8474576271186441, 'reputable': 0.8474576271186441, 'watches:': 0.8474576271186441, 'via': 0.8474576271186441, 'year': 0.5499999999999999, 'off': 0.8474576271186441, 'ems': 0.8474576271186441, 'save': 0.9433962264150942, 'you': 0.34374999999999994, 'and': 0.48458149779735676, 'for': 0.4489795918367347, 'online': 0.859375, 'jewerly\n': 0.8474576271186441, 'warranty\n': 0.8474576271186441, 'enjoy': 0.5499999999999999, 'bags': 0.8474576271186441, 'watches': 0.8474576271186441, 'ups': 0.8474576271186441, 'quality': 0.9569377990430622, 'louis': 0.8474576271186441, 'full': 0.8474576271186441}
-----------------------------------------------------------
spam_22.txt 判定为spam psw is 1.0
{'wallets\n': 0.8474576271186441, 'bags\n': 0.8474576271186441, 'get': 0.2894736842105263, 'courier:': 0.8474576271186441, 'more': 0.4489795918367347, 'vuitton': 0.8474576271186441, 'watchesstore\n': 0.8474576271186441, 'hermes': 0.8474576271186441, 'shipment': 0.8474576271186441, 'brands\n': 0.8474576271186441, 'your': 0.2588235294117647, 'dhl': 0.8474576271186441, 'arolexbvlgari': 0.8474576271186441, 'cartier': 0.8474576271186441, 'fedex': 0.9174311926605505, 'tiffany': 0.8474576271186441, 'oris': 0.8474576271186441, 'recieve': 0.8474576271186441, 'all': 0.55, 'discount': 0.8474576271186441, 'famous': 0.8474576271186441, 'speedpost\n': 0.8474576271186441, 'will': 0.23404255319148934, '100': 0.9779951100244498, 'dior': 0.8474576271186441, 'gucci': 0.8474576271186441, 'reputable': 0.8474576271186441, 'watches:': 0.8474576271186441, 'via': 0.8474576271186441, 'year': 0.5499999999999999, 'off': 0.8474576271186441, 'ems': 0.8474576271186441, 'you': 0.34374999999999994, 'and': 0.48458149779735676, 'order': 0.9433962264150942, 'for': 0.4489795918367347, 'online': 0.859375, 'jewerly\n': 0.8474576271186441, 'warranty\n': 0.8474576271186441, 'enjoy': 0.5499999999999999, 'bags': 0.8474576271186441, 'watches': 0.8474576271186441, 'ups': 0.8474576271186441, 'louis': 0.8474576271186441, 'full': 0.8474576271186441}
-----------------------------------------------------------
spam_8.txt 判定为spam psw is 1.0
{'natural': 0.9652509652509652, 'length': 0.9652509652509652, 'harderecetions\n': 0.9652509652509652, 'betterejacu1ation': 0.9652509652509652, 'ofejacu1ate\n': 0.9652509652509652, 'explosive': 0.9652509652509652, 'safe': 0.970873786407767, 'yourpenis': 0.9652509652509652, 'thickness': 0.9652509652509652, 'designed': 0.859375, 'inches': 0.970873786407767, 'intenseorgasns\n': 0.9652509652509652, 'amazing': 0.9652509652509652, '100': 0.9779951100244498, 'everything': 0.9652509652509652, 'gain': 0.9652509652509652, 'incredib1e': 0.9652509652509652, 'have': 0.6311475409836066, 'control\n': 0.9652509652509652, 'gains': 0.9652509652509652, 'experience': 0.970873786407767, 'you': 0.34374999999999994, 'doctor': 0.9652509652509652, 'and': 0.48458149779735676, 'volume': 0.9652509652509652, 'permanantly\n': 0.9652509652509652, 'rock': 0.9652509652509652, 'increase': 0.9652509652509652, 'endorsed\n': 0.9652509652509652, 'herbal': 0.9652509652509652}
-----------------------------------------------------------
 
Process finished with exit code 0

六.结果分析

从结果来看,10个测试邮件全部都正确分类。但这次的样本集选取过少,且测试集中的邮件大多为测试集中出现过的邮件。为了增加通用性,应该增大测试集的类型。