将 Python 中除名词和形容词外的单词替换为特殊字符串
Replace words into special string except nouns and adjectives in Python
我想将单词(例如动词、副词...)替换为某些特殊字符串(例如"NIL"),但形容词和名词除外。
也就是说,对于一段文字:
anarchism originated as a term of abuse first used against early working class radicals
我先做词性标注(通用格式),得到标注格式:
anarchism/NOUN originated/VERB as/ADP a/DET term/NOUN of/ADP abuse/NOUN first/ADV used/VERB against/ADP early/ADJ working/NOUN class/NOUN radicals/NOUN
我想获取这样的文本:
anarchism/NOUN NIL NIL NIL term/NOUN NIL abuse/NOUN NIL NIL NIL NIL working/NOUN class/NOUN radicals/NOUN
保留名词和形容词,同时用特殊字符串替换其他词(如"NIL")。
Python有什么有效的方法吗,我的语料库大小可能是 10G+。
非常感谢!
尝试将字符串拆分为每个单词,并检查它是什么类型的单词:
string = 'anarchism/NOUN originated/VERB as/ADP a/DET term/NOUN of/ADP abuse/NOUN first/ADV used/VERB against/ADP early/ADJ working/NOUN class/NOUN radicals/NOUN'
string = string.split(' ')
temp = ''
for a in string:
if '/NOUN' in a:
temp += a + ' '
else:
temp += 'NIL '
string = temp
print(string)
你也可以使用这个正则表达式\w*/(?!NOUN)[A-Z]*
>>> import re
>>> s = "anarchism/NOUN originated/VERB as/ADP a/DET term/NOUN of/ADP abuse/NOUN first/ADV used/VERB against/ADP early/ADJ working/NOUN class/NOUN radicals/NOUN"
>>> re.sub("\w*/(?!NOUN)[A-Z]*","NIL",s)
'anarchism/NOUN NIL NIL NIL term/NOUN NIL abuse/NOUN NIL NIL NIL NIL working/NOUN class/NOUN radicals/NOUN'
你可以测试一下here。
我想将单词(例如动词、副词...)替换为某些特殊字符串(例如"NIL"),但形容词和名词除外。
也就是说,对于一段文字:
anarchism originated as a term of abuse first used against early working class radicals
我先做词性标注(通用格式),得到标注格式:
anarchism/NOUN originated/VERB as/ADP a/DET term/NOUN of/ADP abuse/NOUN first/ADV used/VERB against/ADP early/ADJ working/NOUN class/NOUN radicals/NOUN
我想获取这样的文本:
anarchism/NOUN NIL NIL NIL term/NOUN NIL abuse/NOUN NIL NIL NIL NIL working/NOUN class/NOUN radicals/NOUN
保留名词和形容词,同时用特殊字符串替换其他词(如"NIL")。
Python有什么有效的方法吗,我的语料库大小可能是 10G+。
非常感谢!
尝试将字符串拆分为每个单词,并检查它是什么类型的单词:
string = 'anarchism/NOUN originated/VERB as/ADP a/DET term/NOUN of/ADP abuse/NOUN first/ADV used/VERB against/ADP early/ADJ working/NOUN class/NOUN radicals/NOUN'
string = string.split(' ')
temp = ''
for a in string:
if '/NOUN' in a:
temp += a + ' '
else:
temp += 'NIL '
string = temp
print(string)
你也可以使用这个正则表达式\w*/(?!NOUN)[A-Z]*
>>> import re
>>> s = "anarchism/NOUN originated/VERB as/ADP a/DET term/NOUN of/ADP abuse/NOUN first/ADV used/VERB against/ADP early/ADJ working/NOUN class/NOUN radicals/NOUN"
>>> re.sub("\w*/(?!NOUN)[A-Z]*","NIL",s)
'anarchism/NOUN NIL NIL NIL term/NOUN NIL abuse/NOUN NIL NIL NIL NIL working/NOUN class/NOUN radicals/NOUN'
你可以测试一下here。