尝试创建循环遍历列表中的每个项目以在单独的列表中查找任何匹配项,然后替换匹配项

Trying to create for loop which iterates through each item in a list to find any matches in a separate list and then replace the match

我有两个列表,一个包含禁用词,如下所示:

bad_words = ["Boris", "Johnson", "coronavirus", "daily", "cases", "BBC"]

另一个包含新闻文章,其中文章的每一行都已附加到列表中,如下所示:

news article =  ['Boris Johnson outlined a three-tier system, based on the severity of coronavirus cases in each area.' 'The BBC will report more shortly.', 'And so on.', 'And so on.']

我创建了一个 for 循环,它遍历每个禁用词并在新闻文章中搜索它们。然后它用星号替换单词的每个字符。然后它将其弹出到另一个名为 text_bad_words_removed 的列表中。请参阅下面的代码:

for line in news_article:
    for word in bad_words:
        if word in line:
            asterisks_to_replace_word_with = '*'*len(word)
            newline_with_asterisks = re.sub(word, asterisks_to_replace_word_with, str(line))
            text_bad_words_removed.append(newline_with_asterisks)

print(text_bad_words_removed)

结果应该是这样的:

text_bad_words_removed = ['***** ******* outlined a three-tier system, based on the severity of *********** ***** in each area.' , 'The *** will report more shortly.', 'And so on', 'And so on']

然而,它看起来像这样:

text_bad_words_removed = ['***** Johnson outlined a three-tier system, based on the severity of coronavirus cases in each area.', Boris ******* outlined a three-tier system, based on the severity of coronavirus cases in each area.' , 'Boris Johnson outlined a three-tier system, based on the severity of *********** cases in each area.', 'Boris Johnson outlined a three-tier system, based on the severity of coronavirus ***** in each area.', 'The *** will report more shortly.', 'And so on', 'And so on']

问题是,如果同一行中有多个坏词,如果同一行中有另一个坏词,它会再次将整行复制到列表中。如您所见。

我该如何解决这个问题?我能否做到这一点,以便循环替换一行中的所有 bad_words,然后将该行的所有坏词替换到我的新列表中?

您可以预先用坏词编译正则表达式,然后在列表理解中使用它:

import re


bad_words = ["Boris", "Johnson", "coronavirus", "daily", "cases", "BBC"]
news_article =  ['Boris Johnson outlined a three-tier system, based on the severity of coronavirus cases in each area.', 'The BBC will report more shortly.', 'And so on.', 'And so on.']

to_replace = re.compile('|'.join(map(re.escape, bad_words)))
new_txt = [to_replace.sub(lambda g: '*' * len(g.group(0)), line) for line in news_article]

# pretty print to screen 
from pprint import pprint
pprint(new_txt)

打印:

['***** ******* outlined a three-tier system, based on the severity of '
 '*********** ***** in each area.',
 'The *** will report more shortly.',
 'And so on.',
 'And so on.']
for line in range(len(news_article)):
    for word in bad_words:
        if word in news_article[line]:
            news_article[line] = news_article[line].replace(word, '*'*len(word))

您可以使用一个正则表达式和一个循环来完成这一切。
像这样。

>>> import re
>>> news_articles =  ['Boris Johnson outlined a three-tier system, based on the severity of coronavirus cases in each area.', 'The BBC will report more shortly.', 'And so on.', 'And so on.']
>>>
>>> bad_words = ["Boris", "Johnson", "coronavirus", "daily", "cases", "BBC"]
>>> rx = '(?i)(?:{0})'.format('|'.join(bad_words))
>>>
>>> for article in news_articles:
...     articleNew = re.sub(rx, lambda x: '*'*len(x.group()), article)
...     print( articleNew )
...
***** ******* outlined a three-tier system, based on the severity of *********** ***** in each area.
The *** will report more shortly.
And so on.
And so on.