sentinel python 示例 python sentiment analysis

转载

mob6454cc762e37 2024-01-04 06:57:03

文章标签 sentinel python 示例正则化词向量拟合 文章分类 Python 后端开发

代码链接，在原有代码做了些小的修改，适用于python3.6

4 Sentiment Analysis (20')

对于Stanford Sentiment Treebank数据集中的每个句子，我们使用该句子中所有单词向量的平均值作为特征，从而预测情绪水平。

我们将训练softmax分类器，并执行train / dev验证以改进分类器的泛化能力。

(a) 句子的特征表示：取句子中单词向量的平均值

具体参考q4 sentiment.py：

def getSentenceFeatures(tokens,wordVectors,sentence):
    """
    obtain the sentence feature for sentiment analysis by averaging its word vectors.
    :param tokens: a dictionary that maps words to their indices in the word vector list
    :param wordVectors: word vectors (each row) for all tokens
    :param sentence: a list of words in the sentence of interest
    :return: sentVector: feature vector for the sentence
    """
    sentVector=np.zeros((wordVectors.shape[1],))

    ### YOUR CODE HERE
    for s in sentence:
        sentVector+=wordVectors[tokens[s],:]
    sentVector*=1.0/len(sentence)
    ### END YOUR CODE

    assert sentVector.shape==(wordVectors.shape[1],)
    return sentVector

(b) 正则化的原因：

避免过拟合，增强对未知样例的泛华能力

搜索“最佳”正则化参数：

def getRegularizationValues():
    """Try different regularizations
    :return: sorted: a sorted list of values to try
    """
    values=None # Assign a list of floats in the block below
    ### YOUR CODE HERE
    values=np.logspace(-4,2,num=100,base=10)
    ### END YOUR CODE
    return sorted(values)

(1)numpy.arange 函数用于创建数值范围并返回 ndarray 对象：

numpy.arange(start, stop, step, dtype)

参数	描述
`start`	起始值，默认为`0`
`stop`	终止值（不包含）
`step`	步长，默认为`1`
`dtype`	返回`ndarray`的数据类型，如果没有提供，则会使用输入数据的类型。

(2) numpy.linspace 函数用于创建一个一维数组，数组是一个等差数列构成：

np.linspace(start, stop, num=50, endpoint=True, retstep=False, dtype=None)

参数	描述
`start`	序列的起始值
`stop`	序列的终止值，如果`endpoint`为`true`，该值包含于数列中
`num`	要生成的等步长的样本数量，默认为`50`
`endpoint`	该值为 `ture` 时，数列中中包含`stop`值，反之不包含，默认是True。
`retstep`	如果为 True 时，生成的数组中会显示间距，反之不显示。
`dtype`	`ndarray` 的数据类型

(3)numpy.logspace 函数用于创建一个于等比数列

logspace(start, stop, num=50, endpoint=True, base=10.0, dtype=None)

base 参数意思是取对数的时候 log 的下标。

参数	描述
`start`	序列的起始值为：base ** start
`stop`	序列的终止值为：base ** stop。如果`endpoint`为`true`，该值包含于数列中
`num`	要生成的等步长的样本数量，默认为`50`
`endpoint`	该值为 `ture` 时，数列中中包含`stop`值，反之不包含，默认是True。
`base`	对数 log 的底数。
`dtype`	`ndarray` 的数据类型