从文本列表中删除单词

Question

我正在尝试从文本字符串列表中删除某些词（除了使用停用词之外），但由于某些原因它不起作用

documents = ["Human machine interface for lab abc computer applications",
         "A survey of user opinion of computer system response time",
         "The EPS user interface management system",
         "System and human system engineering testing of EPS",
         "Relation of user perceived response time to error measurement",
         "The generation of random binary unordered trees",
         "The intersection graph of paths in trees",
         "Graph minors IV Widths of trees and well quasi ordering",
         "Graph minors A survey"]

exclude = ['am', 'there','here', 'for', 'of', 'user']

new_doc = [word for word in documents if word not in exclude]

print new_doc

输出

['Human machine interface for lab abc computer applications', 'A survey of user opinion of computer system response time', 'The EPS user interface management system', 'System and human system engineering testing of EPS', 'Relation of user perceived response time to error measurement', 'The generation of random binary unordered trees', 'The intersection graph of paths in trees', 'Graph minors IV Widths of trees and well quasi ordering', 'Graph minors A survey']

如您所见，没有从 DOCUMENTS 中删除 EXCLUDE 中的单词（例如 "for" 是一个典型的例子）

它适用于这个运算符：

new_doc = [word for word in str(documents).split() if word not in exclude]

但是我该如何取回 DOCUMENTS 中的初始元素（尽管 "cleaned ones"）？

非常感谢您的帮助！

Answer 1

您正在遍历句子而不是 words.For，您需要拆分句子并使用嵌套循环来遍历您的单词并过滤它们然后加入结果。

>>> new_doc = [' '.join([word for word in sent.split() if word not in exclude]) for sent in documents]
>>> 
>>> new_doc
['Human machine interface lab abc computer applications', 'A survey opinion computer system response time', 'The EPS interface management system', 'System and human system engineering testing EPS', 'Relation perceived response time to error measurement', 'The generation random binary unordered trees', 'The intersection graph paths in trees', 'Graph minors IV Widths trees and well quasi ordering', 'Graph minors A survey']
>>>

此外，您还可以使用 regex 将 exclude 单词替换为具有 re.sub 函数的空字符串，而不是嵌套列表理解、拆分和过滤：

>>> import re
>>> 
>>> new_doc = [re.sub(r'|'.join(exclude),'',sent) for sent in documents]
>>> new_doc
['Human machine interface  lab abc computer applications', 'A survey   opinion  computer system response time', 'The EPS  interface management system', 'System and human system engineering testing  EPS', 'Relation   perceived response time to error measurement', 'The generation  random binary unordered trees', 'The intersection graph  paths in trees', 'Graph minors IV Widths  trees and well quasi ordering', 'Graph minors A survey']
>>>

r'|'.join(exclude) 将用 pip 连接单词（在正则表达式中表示逻辑或）。

Answer 2

您应该在过滤之前将行拆分为单词：

new_doc = [' '.join([word for word in line.split() if word not in exclude]) for line in documents]

从文本列表中删除单词

Deleting words from text list

python

text

stop-words