哪种方法可以有效地删除 textblob 中的停用词以进行文本情感分析？

Question

我正在尝试实施朴素贝叶斯算法以对报纸标题进行情感分析。为此，我正在使用 TextBlob，我发现很难删除停用词，例如 'a'、'the'、'in' 等。下面是我在 [= 中的代码片段16=]:

from textblob.classifiers import NaiveBayesClassifier
from textblob import TextBlob

test = [
("11 bonded labourers saved from shoe firm", "pos"),
("Scientists greet Abdul Kalam after the successful launch of Agni on May 22, 1989","pos"),
("Heavy Winter Snow Storm Lashes Out In Northeast US", "neg"),
("Apparent Strike On Gaza Tunnels Kills 2 Palestinians", "neg")
       ]

with open('input.json', 'r') as fp:
cl = NaiveBayesClassifier(fp, format="json")

print(cl.classify("Oil ends year with biggest gain since 2009"))  # "pos"
print(cl.classify("25 dead in Baghdad blasts"))  # "neg"

Answer 1

您可以先加载 json，然后使用替换创建元组列表（文本、标签）。

示范：

假设 input.json 文件是这样的：

[
    {"text": "I love this sandwich.", "label": "pos"},
    {"text": "This is an amazing place!", "label": "pos"},
    {"text": "I do not like this restaurant", "label": "neg"}
]

那么你可以使用：

from textblob.classifiers import NaiveBayesClassifier
import json

train_list = []
with open('input.json', 'r') as fp:
    json_data = json.load(fp)
    for line in json_data:
        text = line['text']
        text = text.replace(" is ", " ") # you can remove multiple stop words
        label = line['label']
        train_list.append((text, label))
    cl = NaiveBayesClassifier(train_list)

from pprint import pprint
pprint(train_list)

输出：

[(u'I love this sandwich.', u'pos'),
 (u'This an amazing place!', u'pos'),
 (u'I do not like this restaurant', u'neg')]

Answer 2

以下是删除文本中停用词的代码。将所有停用词放在 stopwords 文件中，然后读取这些词并存储到 stop_words 变量中。


# This function reads a file and returns its contents as an array
def readFileandReturnAnArray(fileName, readMode, isLower):
    myArray=[]
    with open(fileName, readMode) as readHandle:
        for line in readHandle.readlines():
            lineRead = line
            if isLower:
                lineRead = lineRead.lower()
            myArray.append(lineRead.strip().lstrip())
    readHandle.close()
    return myArray

stop_words = readFileandReturnAnArray("stopwords","r",True)

def removeItemsInTweetContainedInAList(tweet_text,stop_words,splitBy):
    wordsArray = tweet_text.split(splitBy)
    StopWords = list(set(wordsArray).intersection(set(stop_words)))
    return_str=""
    for word in wordsArray:
        if word not in StopWords:
            return_str += word + splitBy
    return return_str.strip().lstrip()


# Call the above method
tweet_text = removeItemsInTweetContainedInAList(tweet_text.strip().lstrip(),stop_words, " ")

哪种方法可以有效地删除 textblob 中的停用词以进行文本情感分析？

Which is the efficient way to remove stop words in textblob for sentiment analysis of text?

python

sentiment-analysis

text-classification

textblob