拆分为字符串列表中的单词

Question

我想删除停用词。

我有一个包含大约 15,000 个字符串的列表。这些字符串是小文本。我的代码如下：

h = []
for w in clean.split():
    if w not in cachedStopWords:
        h.append(w)
    if w in cachedStopWords:
        h.append(" ")
print(h)

我知道 .split() 是必要的，这样就不会将每个完整的字符串都与停用词列表进行比较。但它似乎不起作用，因为它不能拆分列表。（没有任何类型的拆分 h = 干净，因为没有明显匹配。）

有谁知道我还能如何拆分列表中的不同字符串，同时仍保留不同的大小写？

Answer 1

一个非常简单的例子：

stops = {'remove', 'these', 'words'}

strings = ['please do not remove these words', 'removal is not cool', 'please please these are the bees\' knees', 'there are no stopwords here']

strings_cleaned = [' '.join(word for word in s.split() if word not in stops) for s in strings]

或者你可以这样做：

strings_cleaned = []
for s in strings:
    word_list = []
    for word in s.split():
        if word not in stops:
            word_list.append(word)
    s_string = ' '.join(word_list)
    strings_cleaned.append(s_string)

这比之前的单行线更丑陋（我认为），但可能更直观。

确保将停用词容器转换为 set（一个可散列的容器，它进行查找 O(1) 而不是 list，其查找是 O(n)).

编辑：这只是一个通用的、非常简单的示例，说明如何删除停用词。您的用例可能略有不同，但由于您没有提供数据样本，我们无法提供进一步的帮助。

拆分为字符串列表中的单词

Splitting to words in a list of strings

python

string

split

list

stop-words