朴素贝叶斯算法是基于特征条件独立学习输入输出的联合概率分布,基于此模型对给定的输入利用贝叶斯求得后验概率最大的y输出。现在说下朴素贝叶斯原理

                                P(Y=Ck | X=x) = P(X=x|Y=Ck)*P(Y=Ck)/P(X=x)

也就是说当特征是X的时候,分类是Ck的概率,而当Y的输出有两项时(0,1),如果当X的输入得到Y的输出是0时的概率和X的输入得到Y的输出是1时的概率相比较谁更大,我们则认为X的输入其输出则应该是那个。

# -*-coding:utf-8-*-
import jieba
goods = ["我是个中国人,我很自豪", "我相信你也是"]
bads = ["招商银行信用卡", "来就送", "夏令营"]
test = ["夏令营"]

 这里我们假如,正常分类是好的goods和坏的bads,根据这两项样本数据。而测试样本是test,我们根据分析goods和bads对test进行分类,比较test属于goods的概率和属于bads的概率谁大则朴素贝叶斯认为test属于谁

# 得到所有不重复的word
def get_words(sentences):
words = []
count = 0
for sentence in sentences:
tmp_sentence = jieba.cut(sentence)
for item in tmp_sentence:
count = count + 1
if item not in words:
words.append(item)
return [words, count]

将goods和bads切割成单词,并计算得到好的一方和不好的一方的单词数量,并未计算好的一方每个单词在goods一方出现的概率,分别得到的结果

good_result = get_words(goods)
good_words = good_result[0]
good_count = good_result[1]
bad_result = get_words(bads)
bad_words = bad_result[0]
bad_count = bad_result[1]

接下来需要计算bads一方的每个单词出现的概率

# 垃圾邮件先验概率是0.6 正常邮件先验概率是0.4
# 得到垃圾邮件每个单词的频率
bad_word_freq = {}
for sentence in bads:
tmp_sentence = jieba.cut(sentence)
for item in tmp_sentence:
if item in bad_word_freq:
bad_word_freq[item] = bad_word_freq[item] + 1
else:
bad_word_freq[item] = 1
for item in bad_word_freq.keys():
bad_word_freq[item] = bad_word_freq[item] / bad_count

计算goods一方每个单词出现的概率

# 计算得到正常邮件每个单词出现的概率
good_word_freq = {}
for sentence in goods:
tmp_words = jieba.cut(sentence)
for item in tmp_words:
if item not in good_word_freq.keys():
good_word_freq[item] = 1
else:
good_word_freq[item] = good_word_freq[item] + 1
for item in good_word_freq.keys():
good_word_freq[item] = good_word_freq[item]/good_count

将test中每个单词在bads中出现单词的概率进行连乘,得到的结果,并乘以bads的先验概率

#进行计算获得测试样本的垃圾邮件的概率
bad_lv = 1
for sentence in test:
tmp_words = jieba.cut(sentence)
for item in tmp_words:
if item in bad_word_freq.keys():
bad_lv = bad_lv* bad_word_freq[item]
print("垃圾邮件的概率:" + str(bad_lv*0.4))

将test中每个单词在goods中出现单词的概率进行连乘,得到的结果,并乘以good的先验概率

#计算得到测试样本是正常邮件的概率
good_lv = 1
for sentence in test:
tmp_words = jieba.cut(sentence)
for item in tmp_words:
if item in good_word_freq.keys():
good_lv = good_lv*good_word_freq[item]
print("正常邮件的概率:"+str(good_lv*0.6))

输出的两个值谁大谁是good