该名词和动词的位置
Position of that Noun and Verb
我有一个基于规则的代码,可以打印出句子中的名词后跟动词
for text_id, text in enumerate(news_df['news_title'].values):
# Remove the comma and full stops
text = text.replace(',', '').replace('.', '').replace('-','')
sentence_tags = POSTAG(text.lower())
print(text)
# Sentences parts
for index, part in enumerate(sentence_tags):
try:
if 'NN' in part[1] and 'VB' in sentence_tags[index + 1][1]:
print(">", part[0])
break
elif 'NN' in part[1] and 'NN' in sentence_tags[index + 1][1] and 'VB' in sentence_tags[index + 2][1]:
print(">", part[0], sentence_tags[index + 1][0])
break
elif 'NN' in part[1] and 'NN' in sentence_tags[index + 1][1] and 'NN' in sentence_tags[index + 2][1] and 'VB' in sentence_tags[index + 3][1]:
print(">", part[0], sentence_tags[index + 1][0], sentence_tags[index + 2][0])
break
except:
pass
print()
输出遵循这条规则的句子:
高中橄榄球运动员在视频表面出现欺凌行为后被起诉
> school football players
特朗普原告推动纽约通过成年幸存者法案计划起诉
>trump accuser
有没有办法同时打印出根据规则打印的那个名词的位置?
例如:
>trump accuser , [0,5,"NN"] , [6,13,"VB"]
我更改了脚本并分隔了 state machine
段。这个程序 IMO 最严重的问题是它只是返回第一个模式(你可以很快修复它)。
import pandas as pd
import nltk
POSTAG = nltk.pos_tag
df = pd.DataFrame({'text':['high school football players charged after video surfaces showing hazing', 'trump accuser pushes new york to pass the adult survivors act plans to sue']})
for text_id, text in enumerate(df['text'].values):
# Remove the comma and full stops
text = text.replace(',', '').replace('.', '').replace('-','')
tokens = nltk.word_tokenize(text.lower())
sentence_tags = POSTAG(tokens)
words = [item[0] for item in sentence_tags]
start_end = []
temp = 0
for word in words:
start_end.append([temp, temp+len(word)])
temp+= (len(word)+1)
tags = [item[1] for item in sentence_tags]
words_to_print = []
tags_to_print = []
start_end_to_print = []
# the state machine
verb = False
first_noun = False
second_noun = False
third_noun = False
for w, t, se in zip(words, tags, start_end):
if t.startswith('NN'):
words_to_print.append(w)
tags_to_print.append(t)
start_end_to_print.append(se)
first_noun = True
elif t.startswith('NN') and first_noun:
words_to_print.append(w)
tags_to_print.append(t)
start_end_to_print.append(se)
second_noun = True
elif t.startswith('NN') and second_noun:
words_to_print.append(w)
tags_to_print.append(t)
start_end_to_print.append(se)
third_noun = True
elif t.startswith('VB') and (first_noun or second_noun or third_noun):
break
elif (first_noun or second_noun or third_noun):
words_to_print = []
tags_to_print = []
start_end_to_print = []
verb = False
first_noun, second_noun, third_noun = False, False, False
print('> ', ' '.join(words_to_print), ' '.join([str(item[0])+' '+str(item[1]) for item in zip(start_end_to_print, tags_to_print)]))
输出:
> school football players [5, 11] NN [12, 20] NN [21, 28] NNS
> trump accuser [0, 5] NN [6, 13] NN
我有一个基于规则的代码,可以打印出句子中的名词后跟动词
for text_id, text in enumerate(news_df['news_title'].values):
# Remove the comma and full stops
text = text.replace(',', '').replace('.', '').replace('-','')
sentence_tags = POSTAG(text.lower())
print(text)
# Sentences parts
for index, part in enumerate(sentence_tags):
try:
if 'NN' in part[1] and 'VB' in sentence_tags[index + 1][1]:
print(">", part[0])
break
elif 'NN' in part[1] and 'NN' in sentence_tags[index + 1][1] and 'VB' in sentence_tags[index + 2][1]:
print(">", part[0], sentence_tags[index + 1][0])
break
elif 'NN' in part[1] and 'NN' in sentence_tags[index + 1][1] and 'NN' in sentence_tags[index + 2][1] and 'VB' in sentence_tags[index + 3][1]:
print(">", part[0], sentence_tags[index + 1][0], sentence_tags[index + 2][0])
break
except:
pass
print()
输出遵循这条规则的句子:
高中橄榄球运动员在视频表面出现欺凌行为后被起诉
> school football players
特朗普原告推动纽约通过成年幸存者法案计划起诉
>trump accuser
有没有办法同时打印出根据规则打印的那个名词的位置? 例如:
>trump accuser , [0,5,"NN"] , [6,13,"VB"]
我更改了脚本并分隔了 state machine
段。这个程序 IMO 最严重的问题是它只是返回第一个模式(你可以很快修复它)。
import pandas as pd
import nltk
POSTAG = nltk.pos_tag
df = pd.DataFrame({'text':['high school football players charged after video surfaces showing hazing', 'trump accuser pushes new york to pass the adult survivors act plans to sue']})
for text_id, text in enumerate(df['text'].values):
# Remove the comma and full stops
text = text.replace(',', '').replace('.', '').replace('-','')
tokens = nltk.word_tokenize(text.lower())
sentence_tags = POSTAG(tokens)
words = [item[0] for item in sentence_tags]
start_end = []
temp = 0
for word in words:
start_end.append([temp, temp+len(word)])
temp+= (len(word)+1)
tags = [item[1] for item in sentence_tags]
words_to_print = []
tags_to_print = []
start_end_to_print = []
# the state machine
verb = False
first_noun = False
second_noun = False
third_noun = False
for w, t, se in zip(words, tags, start_end):
if t.startswith('NN'):
words_to_print.append(w)
tags_to_print.append(t)
start_end_to_print.append(se)
first_noun = True
elif t.startswith('NN') and first_noun:
words_to_print.append(w)
tags_to_print.append(t)
start_end_to_print.append(se)
second_noun = True
elif t.startswith('NN') and second_noun:
words_to_print.append(w)
tags_to_print.append(t)
start_end_to_print.append(se)
third_noun = True
elif t.startswith('VB') and (first_noun or second_noun or third_noun):
break
elif (first_noun or second_noun or third_noun):
words_to_print = []
tags_to_print = []
start_end_to_print = []
verb = False
first_noun, second_noun, third_noun = False, False, False
print('> ', ' '.join(words_to_print), ' '.join([str(item[0])+' '+str(item[1]) for item in zip(start_end_to_print, tags_to_print)]))
输出:
> school football players [5, 11] NN [12, 20] NN [21, 28] NNS
> trump accuser [0, 5] NN [6, 13] NN