python中HRV特征提取 python 特征提取

转载

mob6454cc623087 2024-05-31 10:20:41

文章标签 python中HRV特征提取 python scikit-learn sklearn ci 文章分类 Python 后端开发

特征提取，简单来说是把一系列数据中转换为可用于机器学习的数字特征。sklearn.feature_extraction是scikit-learn特征提取的模块

本文分别总结以下内容：

Onehot编码
DictVectorizer使用
CountVectorizer使用
TfidfVectorizer使用
HashingVectorizer使用

1.Onehot编码

上面说过特征转化为机器学习的数字特征，其实就是转化为Onehot编码。

为什么要转化为onehot编码？先看看下面的数据。

python中HRV特征提取 python 特征提取_python中HRV特征提取

这个是1688网站的部分蓝牙耳机数据，从上面表格看“类型”那一列是字符类型，机器是不认识的，因此就需要转为数字类型。

Onehot编码是怎么样的？

python中HRV特征提取 python 特征提取_ci_02

调用pandas自带函数getdummies()自动转为Onehot编码。从上图看出数值型的数据保留不变，其他根据“类型”有多少种就新建多少列，数值用1和0表示。例如第一行数据是耳塞式的，那么“类型耳塞式”列填1，其他为0，以此类推。

所以经过了Onehot编码后，字符型的数据也转为数值型替代，机器也能识别了。

2.DictVectorizer

DictVectorizer类在sklearn.feature_extraction.DictVectorizer下，用于将以标准Pythondict对象列表为表示形式的要素数组转换为scikit-learn估计器使用的NumPy / SciPy形式。例子：

python中HRV特征提取 python 特征提取_sklearn_03

从上面例子看出，DictVectorizer把Python的Dict类型数据提取自动转化为Onehot的编码。跟直接使用panads函数getdummies()生成的结果类似。

值得注意的是：vec.fit_transform(measurements)返回的是3x4 sparse matrix，即是一个scipy.sparse矩阵。

问题：为什么转化scipy.sparse矩阵？
在上面的例子中，假如加上几百个城市数据，那么onehot编码后数据列会生成很多，而且很多值都为0，为了使生成的数据结构能够适合内存，DictVectorizer类默认使用scipy.sparse矩阵而不是numpy.ndarray

2.CountVectorizer

CountVectorizer类在sklearn.feature_extraction.text.CountVectorizer下，先看看CountVectorizer类源码解释。

Convert a collection of text documents to a matrix of token counts
This implementation produces a sparse representation of the counts using scipy.sparse.csr_matrix

意思是将文本文档集合转换为令牌计数矩阵，也就是统计一些列文档中每个单词出现的频次。

而且也是转化为scipy.sparse矩阵。因为是统计词频，而文章的单词成千上万，那么避免生成很宽列表。

CountVectorizer使用方法很简单，例子：

from sklearn.feature_extraction.text import CountVectorizer
corpus = [
     'This is the first document.',
     'This is the second second document.',
     'And the third one.',
     'Is this the first document?',
]
vec = CountVectorizer()
ft = vec.fit_transform(corpus)
print(vec.get_feature_names())
print(ft.toarray())

输出结果：
['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']

[[0 1 1 1 0 0 1 0 1]
 [0 1 0 1 0 2 1 0 1]
 [1 0 0 0 1 0 1 1 0]
 [0 1 1 1 0 0 1 0 1]]

从例子看出，CountVectorizer默认采用空格进行词汇分割，分割完后把每个出现的词语统计次数。

问题：英文词语是用空格分开的，那中文词语怎么处理呢？
使用jieba分词库进行词汇分割，然后使用空格进行字符连接。

import jieba
from sklearn.feature_extraction.text import CountVectorizer

sentences = []
#原语料
corpus = ["现在晴朗，放心出门",
          "天气很好，温度适宜",
          "马上下雨，带好雨伞"]

#使用jieba把语料进行词汇分割，并把结果放在sentences里
for word in corpus:
    lyrics = jieba.cut(word)
    sentences.append(" ".join(lyrics))

vec = CountVectorizer()
ft = vec.fit_transform(sentences)
print(vec.get_feature_names())
print(ft.toarray())


输出结果：
['下雨', '出门', '天气', '放心', '晴朗', '温度', '现在', '适宜', '雨伞', '马上']

[[0 1 0 1 1 0 1 0 0 0]
 [0 0 1 0 0 1 0 1 0 0]
 [1 0 0 0 0 0 0 0 1 1]]

在上面的例子中只是用CountVectorizer的默认参数，但需要注意下面几个参数：

preprocessor：用于在标记化之前对文本进行预处理的函数，默认值为None。

tokenizer：用于将字符串分割为一系列标记函数，默认值为None。只适用于analyzer == 'word'。

问题：在上面例子中能不能使用别的标记分割词语，例如/?*等等字符，能不能把“下雨”改成“下雪”？
tokenizer能够指定分割的方式，preprocessor能对语料进行预处理，例如删除替换等等。

稍微改动一下语料，例子：

from sklearn.feature_extraction.text import CountVectorizer

# 原语料
corpus = ["现在/晴朗/放心/出门",
          "天气/很好/温度/适宜",
          "马上/下雨/带好/雨伞"]

# 采用“/”进行词汇分割
def my_tokenizer(s):
    return s.split("/")

# 把所有单词“下雨”改成“下雪”
def my_preprocessor(s):
    return s.replace("下雨", "下雪")

vec = CountVectorizer(tokenizer=my_tokenizer, preprocessor=my_preprocessor)
ft = vec.fit_transform(corpus)
print(vec.get_feature_names())
print(ft.toarray())


输出结果：
['下雪', '出门', '天气', '带好', '很好', '放心', '晴朗', '温度', '现在', '适宜', '雨伞', '马上']

[[0 1 0 0 0 1 1 0 1 0 0 0]
 [0 0 1 0 1 0 0 1 0 1 0 0]
 [1 0 0 1 0 0 0 0 0 0 1 1]]

stop_words：指定停用词，如果传入“english”，则使用内置的英语停止词列表，如果传入列表，则假定该列表包含停止词，所有这些词都将从结果标记中删除，默认为None，只适用于analyzer == 'word'。

问题：停用词有什么作用？
在文章分类预测中，往往需要统计文章中每个单词的词频，并计算单词的权重进行预测分类。但是有一些词语是中性的，并不能很好反应文章的类型，例如“我，你，他，今天，明天，后天，现在，马上”等等。停用词作用就是忽略这些单词，从而起到筛选重要词汇的作用。

from sklearn.feature_extraction.text import CountVectorizer

# 原语料
corpus = ["现在/晴朗/放心/出门",
          "天气/很好/温度/适宜",
          "马上/下雨/带好/雨伞"]

# 采用“/”进行词汇分割
def my_tokenizer(s):
    return s.split("/")

# 把所有单词“下雨”改成“下雪”
def my_preprocessor(s):
    return s.replace("下雨", "下雪")

# 指定停用词列表，删除"现在", "马上"两个单词
stop_words=["现在", "马上"]

vec = CountVectorizer(tokenizer=my_tokenizer, preprocessor=my_preprocessor, stop_words=stop_words)
ft = vec.fit_transform(corpus)
print(vec.get_feature_names())
print(ft.toarray())


输出结果：
['下雪', '出门', '天气', '带好', '很好', '放心', '晴朗', '温度', '适宜', '雨伞']

[[0 1 0 0 0 1 1 0 0 0]
 [0 0 1 0 1 0 0 1 1 0]
 [1 0 0 1 0 0 0 0 0 1]]

停用词并非固定不变，要根据自己的业务进行修改。分享几个中文常用停用词表

3.TfidfVectorizer

在一个大型文本语料库中，有些高频出现的词，几乎没有携带任何与文档内容相关的有用信息。如果我们将统计数据直接提供给分类器，那么这些高频出现的词会掩盖住那些我们关注但出现次数较少的词。当然可以加入停用词表中，但是根据逻辑业务不同，并不能完全避免这种情况。那么就需要tf–idf变换计算词义的权重，也就是词语的重要性。tf–idf变换的数学公式：

$tf-idf(t,d) = tf(t,d)\times idf(t)$

tf表示词频，idf表示逆文档频率，计算公式为:

$idf(t) = log\frac{n}{1+df(t)}$

表示文档集中的文档总数，
表示文档集中含有该词语的文档个数。举子例子：某文档库里面含有100个文档，其中10个文档包含“科技”这个词，现在有一个文档预测类型，其中这个文档有500个词汇，“科技”出现20次，那么“科技”tf–idf值为：

$tf = \frac{20}{500} = \frac{1}{25}$

$idf = log\frac{100}{20+1}$

那么

$tf-idf(t,d) = \frac{1}{25} \times log\frac{100}{20+1} = 0.71778$

TfidfVectorizer类在sklearn.feature_extraction.text.TfidfVectorizer下，实现了tf–idf变换，简单例子：

from sklearn.feature_extraction.text import TfidfVectorizer
corpus = [
     'first document.',
     'second document.',
     'second one.'
]
vec = TfidfVectorizer()
ft = vec.fit_transform(corpus)
print(vec.get_feature_names())
print(ft.toarray())

输出结果：
['document', 'first', 'one', 'second']

[[0.60534851 0.79596054 0.         0.        ]
 [0.70710678 0.         0.         0.70710678]
 [0.         0.         0.79596054 0.60534851]]

TfidfVectorizer计算方式有些不一样。是给定文档中单词出现的次数乘以idf，idf的计算公式也有点差异：

$idf(t) = log\frac{1+n}{1+df(t)}+1$

然后将所得的tf-idf通过欧几里得(L2)范数归一化，计算公式为：

$v_{norm} = \frac{v}{\left| \left| x \right| \right|}_{2} = \frac{v}{\sqrt{v_{1}^{2}+v_{2}^{2}+v_{3}^{2}+...+v_{n}^{2}}}$

根据公式，上面的例子试着手动计算一下第一个文档的"document"的值，从语料看出

“document”在第一个文档出现了1次，总共有3个文档，有2个文档出现“document”，

“first”在第一个文档出现了1次，总共有3个文档，有1个文档出现“first”，

“one”在第一个文档出现了0次，总共有3个文档，有1个文档出现“one”，

“second”在第一个文档出现了0次，总共有3个文档，有2个文档出现“second”，

带入公式得出：

"document"的tf-idf：

$tf-idf(t,d) = 1 \times log\frac{1+3}{1+2}+1 \approx 1.2876820724517808$

"first"的tf-idf：

$tf-idf(t,d) = 1 \times log\frac{1+3}{1+1}+1 \approx 1.6931471805599454$

"one"的的tf-idf：

$tf-idf(t,d) = 0 \times log\frac{1+3}{1+1}+1 = 0$

"second"的的tf-idf：

$tf-idf(t,d) = 0 \times log\frac{1+3}{1+2}+1 = 0$

然后欧几里得(L2)范数归一化:

$\frac{1.2876820724517808}{\sqrt{1.2876820724517808^{2}+1.6931471805599454^{2}+0^{2}+0^{2}}}\approx 0.605$

其他参数也可以通过这种方式计算出来。

TfidfVectorizer参数

在讲参数之前先解释一下TfidfVectorizer和CountVectorizer的关系，在scikit-learn中有TfidfTransformer这个类。简单来说：

TfidfVectorizer = CountVectorizer + TfidfTransformer

也就是上面的例子使用 CountVectorizer和TfidfTransformer也能达到同样的效果。

因此上面CountVectorizer讲述的参数，TfidfVectorizer 也存在同样的参数，而且用法也是一致的。

其他参数：

norm：使用L1或者L2范数，默认值是L2。None则不使用范数，例如

from sklearn.feature_extraction.text import TfidfVectorizer
corpus = [
     'first document.',
     'second document.',
     'second one.'
]
vec = TfidfVectorizer(norm=None)
ft = vec.fit_transform(corpus)
print(vec.get_feature_names())
print(ft.toarray())

输出
['document', 'first', 'one', 'second']

[[1.28768207 1.69314718 0.         0.        ]
 [1.28768207 0.         0.         1.28768207]
 [0.         0.         1.69314718 1.28768207]]

从结果看到第一行的数据跟上面手动计算的值是一样的。

use_idf：是否使用idf权重，默认值是True，如果设置False，计算的时候就不乘以idf权重，只进行tf计算后L2正则化得出结果。

smooth_idf：加1平滑idf权重，默认值是True。如果设置False，则idf的公式为：

$idf(t) = log\frac{n}{df(t)}+1$

4.HashingVectorizer

HashingVectorizer类在sklearn.feature_extraction.text.HashingVectorizer下，通过“哈希技巧”进行特征提取，它的内存非常低，适用大型数据集。例子：

from sklearn.feature_extraction.text import HashingVectorizer
corpus = [
     'This is the first document.',
     'This is the second second document.',
     'And the third one.',
     'Is this the first document?',
]
vec = HashingVectorizer(n_features=2**5)
ft = vec.fit_transform(corpus)
print(ft)

在内存紧张情况也可以FeatureHasher配合CountVectorizer使用。

HashingVectorizer能替代TfidfVectorizer进行特征提取功能，但是HashingVectorizer不提供IDF加权，可以加入TfidfTransformer配合使用。

截取一段官网上的例子：

if opts.use_hashing:
    # 使用HashingVectorizer进行特征提取，根据opts.use_idf值是否进行IDF加权
    if opts.use_idf:
        # 使用TfidfTransformer进行IDF加权
        # 在HashingVectorizer的输出上执行IDF归一化
        hasher = HashingVectorizer(n_features=opts.n_features,
                                   stop_words='english', alternate_sign=False,
                                   norm=None)
        vectorizer = make_pipeline(hasher, TfidfTransformer())
    else:
        # 不使用IDF加权
        vectorizer = HashingVectorizer(n_features=opts.n_features,
                                       stop_words='english',
                                       alternate_sign=False, norm='l2')
else:
   # 使用TfidfVectorizer进行特征提取，根据opts.use_idf值是否进行IDF加权
    vectorizer = TfidfVectorizer(max_df=0.5, max_features=opts.n_features,
                                 min_df=2, stop_words='english',
                                 use_idf=opts.use_idf)

注意的是：make_pipeline是使用管道把各种提取器揉合在一起。

scikit-learn特征提取讲解到此结束，作者水平有限，欢迎各位高人指正。

本文章为转载内容，我们尊重原作者对文章享有的著作权。如有内容错误或侵权问题，欢迎原作者联系我们进行内容更正或删除文章。