通过使用 Python 插入代码字来修改语料库

Modifying corpus by inserting codewords using Python

我在 csv 文件(或 txt 文件)中有大约一个语料库(30,000 条客户评论)。这意味着每个客户评论都是文本文件中的一行。一些例子是:

我想将这些文本更改为以下内容:

我有两个单独的正面词和负面词列表(词典)。例如,一个文本文件包含这样的肯定词:

而且,一个文本文件包含这样的否定词:

所以,我想要读取客户评论的Python脚本:当找到任何正面词时,然后在正面词后插入"POSITIVE";当找到任何否定词时,则在肯定词后插入 "NEGATIVE"。

这是我目前测试过的代码。这行得通(请参阅我在下面代码中的评论),但需要改进才能满足我的上述需求。

具体来说,my_escaper 有效(这段代码找到了 cheap 和 good 这样的词,并将它们替换为 cheap POSITIVE 和 good POSITIVE),但问题是我有两个文件(词典),每个文件包含大约千 positive/negative 字。所以我想要的是代码从词典中读取那些单词列表,在语料库中搜索它们,并替换语料库中的那些单词(例如,从 "good" 到 "good POSITIVE",从 "bad" 到 "bad NEGATIVE").

#adapted from 

import re

def multiple_replacer(*key_values):
    replace_dict = dict(key_values)
    replacement_function = lambda match: replace_dict[match.group(0)]
    pattern = re.compile("|".join([re.escape(k) for k, v in key_values]), re.M)
    return lambda string: pattern.sub(replacement_function, string)

def multiple_replace(string, *key_values):
    return multiple_replacer(*key_values)(string)

#this my_escaper works (this code finds such words as cheap and good and replace them with cheap POSITIVE and good POSITIVE), but the problem is that I have two files (lexicons), each containing about thousand positive/negative words. So what I want is that the codes read those word lists from the lexicons, search them in the corpus, and replace those words in the corpus (for example, from "good" to "good POSITIVE", from "bad" to "bad NEGATIVE")      

my_escaper = multiple_replacer(('cheap','cheap POSITIVE'), ('good', 'good POSITIVE'), ('avoid', 'avoid NEGATIVE'))

d = []
with open("review.txt","r") as file:
    for line in file:
        review = line.strip()
        d.append(review) 

for line in d:
    print my_escaper(line) 

如果我没理解错的话,你需要这样的东西:

if word in POSITIVE_LIST:
  pattern.sub(replacement_function, word+" POSITIVE")
if word in NEGATIVE_LIST:
  pattern.sub(replacement_function, word+" NEGATIVE")

你还好吗?

编写此代码的一种直接方法是将词典中的肯定词和否定词加载到单独的集合中。然后,对于每个评论,将句子拆分成一个单词列表,并在情感集中查找每个单词。检查集成员是 O(1) in the average case。将情感标签(如果有)插入单词列表,然后加入以构建最终字符串。

示例:

import re

reviews = [
    "This bike is amazing, but the brake is very poor",
    "This ice maker works great, the price is very reasonable, some bad smell from the ice maker",
    "The food was awesome, but the water was very rude"
    ]

positive_words = set(['amazing', 'great', 'awesome', 'reasonable'])
negative_words = set(['poor', 'bad', 'rude'])

for sentence in reviews:
    tagged = []
    for word in re.split('\W+', sentence):
        tagged.append(word)
        if word.lower() in positive_words:
            tagged.append("POSITIVE")
        elif word.lower() in negative_words:
            tagged.append("NEGATIVE")
    print ' '.join(tagged)

虽然这种方法很简单,但有一个缺点:由于使用 re.split(),您会丢失标点符号。