通过添加自定义停用词来更新停用词集

Update the set of stop words by adding the custom stop words

我们提供了一组默认的停用词,我们需要添加一些额外的自定义词组并从给定的句子中删除这些词并获得没有停用词的句子。

我试过了,但得到的输出是 NONE。请帮忙!

sentence = 'Hello, good morning folks! Today we will announce the half yearly performance results of the company. Due to the ongoing COVID-19 pandemic, our profits have declined by 60% as compared to the last year'

stop_words = { "i", "me", "my", "myself", "we", "our", "ours", "ourselves", "you", "your", "yours", "yourself", "yourselves", "he", "him", "his", himself", "she", "her", "hers", "herself", "it", "its", "itself", "they", "them", "their", "theirs", "themselves", "what", "which", "who", "whom", "this", "that", "these", "those", "am", "is", "are", "was", "were", "be", "been", "being", "have", "has", "had", "having", "do", "does", "did", "doing", "a", "an", "the", "and", "but", "if", "or", "because", "as", "until", "while", "of", "at", "by", "for", "with", "about", "against", "between", "into", "through", "during", "before", "after", "above", "below", "to", "from", "up", "down", "in", "out", "on", "off", "over", "under", "again", "further", "then", "once", "here", "there", "when", "where", "why", "how", "all", "any", "both", "each", "few", "more", "most", "other", "some", "such", "no", "nor", "not", "only", "own", "same", "so", "than", "too", "very", "s", "t", "can", "will", "just", "don", "should", "now"}

    custom_stop_words = ["hello","good","morning","half","year"]
    updated_stop_words= list(stop_words).append(custom_stop_words)

    print(updated_stop_words)


    **Output:
    NONE**

问题是 list.append 修改了列表 in-place 但你没有将 list(stop_words) 分配给任何东西,所以之后没有变量可以使用。

将自定义词添加到现有停用词的一种方法是将 stop_words 列表分配给变量,然后使用 list.extend:

updated_stop_words= list(stop_words)
updated_stop_words.extend(custom_stop_words)

但是,由于稍后您将在推导式中测试集合成员资格,因此 updated_stop_words 似乎更适合作为一个集合。由于 stop_words 已经是一个集合,您可以使用联合运算符将 custom_stop_words 添加到它。

然后在一个循环中,你可以检查一个词是否在 updated_stop_words 中,如果是则将其排除。

import string
updated_stop_words= stop_words | set(custom_stop_words)
out = [w for w in sentence.split() if w.lower().rstrip(string.punctuation) not in updated_stop_words]

输出:

['folks!', 'Today', 'announce', 'yearly', 'performance', 'results', 
 'company.', 'Due', 'ongoing', 'COVID-19', 'pandemic,', 'profits', 
 'declined', '60%', 'compared', 'last']

方法append实际上returnsNone。它将一个对象附加到原始列表。因此,不要将结果分配给另一个变量,只需使用“追加”并打印原始列表即可。

此外,我建议使用 extend 而不是 append,这样您就可以将元素作为字符串添加到停用词,而不是附加整个列表。

以这种方式尝试代码:

sentence = 'Hello, good morning folks! Today we will announce the half yearly performance results of the company. Due to the ongoing COVID-19 pandemic, our profits have declined by 60% as compared to the last year'

stop_words = [ "i", "me", "my", "myself", "we", "our", "ours", "ourselves", "you", "your", "yours", "yourself", "yourselves", "he", "him", "his", "himself", "she", "her", "hers", "herself", "it", "its", "itself", "they", "them", "their", "theirs", "themselves", "what", "which", "who", "whom", "this", "that", "these", "those", "am", "is", "are", "was", "were", "be", "been", "being", "have", "has", "had", "having", "do", "does", "did", "doing", "a", "an", "the", "and", "but", "if", "or", "because", "as", "until", "while", "of", "at", "by", "for", "with", "about", "against", "between", "into", "through", "during", "before", "after", "above", "below", "to", "from", "up", "down", "in", "out", "on", "off", "over", "under", "again", "further", "then", "once", "here", "there", "when", "where", "why", "how", "all", "any", "both", "each", "few", "more", "most", "other", "some", "such", "no", "nor", "not", "only", "own", "same", "so", "than", "too", "very", "s", "t", "can", "will", "just", "don", "should", "now"]

custom_stop_words = ["hello","good","morning","half","year"]
stop_words.extend(custom_stop_words)

print(stop_words)

注:

您的停用词词典中的单词 himself 前缺少一个双引号。您还需要删除后面几行之前的缩进。鉴于您的代码 运行s 成功,我猜这些问题只是问题中的拼写错误,而不是您 运行.

的代码中的错别字