将 Python 中除名词和形容词外的单词替换为特殊字符串

Replace words into special string except nouns and adjectives in Python

我想将单词(例如动词、副词...)替换为某些特殊字符串(例如"NIL"),但形容词和名词除外。

也就是说,对于一段文字:

anarchism originated as a term of abuse first used against early working class radicals

我先做词性标注(通用格式),得到标注格式:

anarchism/NOUN originated/VERB as/ADP a/DET term/NOUN of/ADP abuse/NOUN first/ADV used/VERB against/ADP early/ADJ working/NOUN class/NOUN radicals/NOUN

我想获取这样的文本:

anarchism/NOUN NIL NIL NIL term/NOUN NIL abuse/NOUN NIL NIL NIL NIL working/NOUN class/NOUN radicals/NOUN

保留名词和形容词,同时用特殊字符串替换其他词(如"NIL")。

Python有什么有效的方法吗,我的语料库大小可能是 10G+。

非常感谢!

尝试将字符串拆分为每个单词,并检查它是什么类型的单词:

string = 'anarchism/NOUN originated/VERB as/ADP a/DET term/NOUN of/ADP abuse/NOUN first/ADV used/VERB against/ADP early/ADJ working/NOUN class/NOUN radicals/NOUN'
string = string.split(' ')
temp = ''
for a in string:
    if '/NOUN' in a:
        temp += a + ' '
    else:
        temp += 'NIL '
string = temp
print(string)

你也可以使用这个正则表达式\w*/(?!NOUN)[A-Z]*

>>> import re
>>> s = "anarchism/NOUN originated/VERB as/ADP a/DET term/NOUN of/ADP abuse/NOUN first/ADV used/VERB against/ADP early/ADJ working/NOUN class/NOUN radicals/NOUN"
>>> re.sub("\w*/(?!NOUN)[A-Z]*","NIL",s)
'anarchism/NOUN NIL NIL NIL term/NOUN NIL abuse/NOUN NIL NIL NIL NIL working/NOUN class/NOUN radicals/NOUN'

你可以测试一下here