如果文本列表和删除单词列表很大，如何处理从文本列表列表中删除一些单词

Question

通过使用Python，我想从文本中删除一些单词，这些单词由列表列表组成，如下所示（例如text_list由5个文本组成，每个文本包含大约 4 到 8 个单词，以及 5 个单词的删除单词列表）：

text_list = [["hello", "how", "are", "you", "fine", "thank", "you"],
             ["good", "morning", "have", "great", "breakfast"],
             ["you", "are", "a", "student", "I", "am", "a", "teacher"],
             ["trump", "it", "is", "a", "fake", "news"],
             ["obama", "yes", "we", "can"]]

remove_words = ["hello", "breakfast", "a", "obama", "you"]

当你处理像上面这样的小数据时，这是一个非常简单的问题，如下所示：

new_text_list = list()
for text in text_list:
    temp_list = list()
    for word in text:
        if word not in remove_words:
            temp_list.append(word)
    new_text_list.append(temp_list)

但是遇到10000多篇文章，每篇1000多个词，以及20000多词的去词表，我想知道你是怎么处理的有了这样的情况。是否有任何有效的 Python 代码可以产生相同的结果或任何多核处理程序等等？提前致谢！

Answer 1

尝试按字母顺序对每个子数组进行排序，然后对每个子数组调用二进制搜索以找到您要删除的相应元素。它应该加快这个过程！

Answer 2

加快进程的两种基本技术是

1) set 对象在测试是否包含时具有（大部分）线性访问时间，而 list 对象需要遍历整个列表，因此它依赖于列表大小（iow，包含测试时间与列表的大小成比例增长）

2) 尽可能避免进行中间集合，尽可能使用生成器和推导式，以便对它们进行惰性求值

下面是一个同时使用这两种方法的示例：

#!/usr/bin/env python3

text_list = [["hello", "how", "are", "you", "fine", "thank", "you"],
             ["good", "morning", "have", "great", "breakfast"],
             ["you", "are", "a", "student", "I", "am", "a", "teacher"],
             ["trump", "it", "is", "a", "fake", "news"],
             ["obama", "yes", "we", "can"]]

# use a set() for remove words because testing for inclusion is much faster than a long list
# removed two of your original bad words so I could make sure it passed some
remove_words = set(["hello", "breakfast", "obama"])

#make a generator, rather than a list, because why not?
result = (sentence for sentence in text_list if all(word not in remove_words for word in sentence))

for acceptable in result:
    print(acceptable)

如果文本列表和删除单词列表很大，如何处理从文本列表列表中删除一些单词

How to deal with removing some words out of lists of text list if text list and remove word list are huge

python

multithreading

memory-efficient