如果文本列表和删除单词列表很大,如何处理从文本列表列表中删除一些单词
How to deal with removing some words out of lists of text list if text list and remove word list are huge
通过使用Python,我想从文本中删除一些单词,这些单词由列表列表组成,如下所示(例如text_list由5个文本组成,每个文本包含大约 4 到 8 个单词,以及 5 个单词的删除单词列表):
text_list = [["hello", "how", "are", "you", "fine", "thank", "you"],
["good", "morning", "have", "great", "breakfast"],
["you", "are", "a", "student", "I", "am", "a", "teacher"],
["trump", "it", "is", "a", "fake", "news"],
["obama", "yes", "we", "can"]]
remove_words = ["hello", "breakfast", "a", "obama", "you"]
当你处理像上面这样的小数据时,这是一个非常简单的问题,如下所示:
new_text_list = list()
for text in text_list:
temp_list = list()
for word in text:
if word not in remove_words:
temp_list.append(word)
new_text_list.append(temp_list)
但是遇到10000多篇文章,每篇1000多个词,以及20000多词的去词表,我想知道你是怎么处理的有了这样的情况。是否有任何有效的 Python 代码可以产生相同的结果或任何多核处理程序等等?提前致谢!
尝试按字母顺序对每个子数组进行排序,然后对每个子数组调用二进制搜索以找到您要删除的相应元素。它应该加快这个过程!
加快进程的两种基本技术是
1) set
对象在测试是否包含时具有(大部分)线性访问时间,而 list
对象需要遍历整个列表,因此它依赖于列表大小(iow,包含测试时间与列表的大小成比例增长)
2) 尽可能避免进行中间集合,尽可能使用生成器和推导式,以便对它们进行惰性求值
下面是一个同时使用这两种方法的示例:
#!/usr/bin/env python3
text_list = [["hello", "how", "are", "you", "fine", "thank", "you"],
["good", "morning", "have", "great", "breakfast"],
["you", "are", "a", "student", "I", "am", "a", "teacher"],
["trump", "it", "is", "a", "fake", "news"],
["obama", "yes", "we", "can"]]
# use a set() for remove words because testing for inclusion is much faster than a long list
# removed two of your original bad words so I could make sure it passed some
remove_words = set(["hello", "breakfast", "obama"])
#make a generator, rather than a list, because why not?
result = (sentence for sentence in text_list if all(word not in remove_words for word in sentence))
for acceptable in result:
print(acceptable)
通过使用Python,我想从文本中删除一些单词,这些单词由列表列表组成,如下所示(例如text_list由5个文本组成,每个文本包含大约 4 到 8 个单词,以及 5 个单词的删除单词列表):
text_list = [["hello", "how", "are", "you", "fine", "thank", "you"],
["good", "morning", "have", "great", "breakfast"],
["you", "are", "a", "student", "I", "am", "a", "teacher"],
["trump", "it", "is", "a", "fake", "news"],
["obama", "yes", "we", "can"]]
remove_words = ["hello", "breakfast", "a", "obama", "you"]
当你处理像上面这样的小数据时,这是一个非常简单的问题,如下所示:
new_text_list = list()
for text in text_list:
temp_list = list()
for word in text:
if word not in remove_words:
temp_list.append(word)
new_text_list.append(temp_list)
但是遇到10000多篇文章,每篇1000多个词,以及20000多词的去词表,我想知道你是怎么处理的有了这样的情况。是否有任何有效的 Python 代码可以产生相同的结果或任何多核处理程序等等?提前致谢!
尝试按字母顺序对每个子数组进行排序,然后对每个子数组调用二进制搜索以找到您要删除的相应元素。它应该加快这个过程!
加快进程的两种基本技术是
1) set
对象在测试是否包含时具有(大部分)线性访问时间,而 list
对象需要遍历整个列表,因此它依赖于列表大小(iow,包含测试时间与列表的大小成比例增长)
2) 尽可能避免进行中间集合,尽可能使用生成器和推导式,以便对它们进行惰性求值
下面是一个同时使用这两种方法的示例:
#!/usr/bin/env python3
text_list = [["hello", "how", "are", "you", "fine", "thank", "you"],
["good", "morning", "have", "great", "breakfast"],
["you", "are", "a", "student", "I", "am", "a", "teacher"],
["trump", "it", "is", "a", "fake", "news"],
["obama", "yes", "we", "can"]]
# use a set() for remove words because testing for inclusion is much faster than a long list
# removed two of your original bad words so I could make sure it passed some
remove_words = set(["hello", "breakfast", "obama"])
#make a generator, rather than a list, because why not?
result = (sentence for sentence in text_list if all(word not in remove_words for word in sentence))
for acceptable in result:
print(acceptable)