如何从变量或列表中删除某些单词 python

Question

common_words = set(['je', 'tek', 'u', 'još', 'a', 'i', 'bi',
            's', 'sa', 'za', 'o', 'kojeg', 'koju', 'kojom', 'kojoj',
            'kojega', 'kojemu', 'će', 'što', 'li', 'da', 'od', 'do',
            'su', 'ali', 'nego', 'već', 'no', 'pri', 'se', 'li',
            'ili', 'ako', 'iako', 'bismo', 'koji', 'što', 'da', 'nije',
            'te', 'ovo', 'samo', 'ga', 'kako', 'će', 'dobro',
            'to', 'sam', 'sve', 'smo', 'kao'])
all = []


for (item_content, item_title, item_url, fetch_date) in cursor:
             #text = "{}".format(item_content)
             text= item_content
             text= re.sub('[,.?";:\-!@#$%^&*()]', '', text)
             text = text.lower()
             #text = [w for w in text if not w in common_words]
             all.append(text)

我想从变量 "test" 或稍后从列表 "all" 中删除某些 words/stopword 我将迭代中的所有 "text" 变量放入。

我这样试过，但这不仅会删除单词，还会删除那些字母（如果它们存在于其他单词中）并且每个单词的输出都像 'd'、'f'，并且我希望格式保持不变，我只需要从变量（或列表）中删除 common_words 列表中的那些单词。我将如何实现？

Answer 1

作为从测试中删除标点符号的 pythonic 方法，您可以使用 str.translate 方法：

>>> "this is224$# a ths".translate(None,punctuation)
'this is224 a ths'

并使用 re.sub 替换单词，首先创建正则表达式并将 pip (|) 附加到单词 :

reg='|'.join(common_words)
new_text=re.sub(reg,'',text)

示例：

>>> s="this is224$# a ths"
>>> import re
>>> w=['this','a']
>>> boundary_words=['\b{}\b'.format(i) for i in w]
>>> reg='|'.join(oundary_words)
>>> new_text=re.sub(reg,'',s).translate(None,punctuation)
>>> new_text
' is224  ths'

如何从变量或列表中删除某些单词 python

How to delete certain words from a variable or a list python

python

text

replace

stop-words