如何从文本中提取动词和所有对应的副词?
How to extract the verbs and all corresponding adverbs from a text?
在 Python 中使用 ngram 我的目标是从输入文本中找出动词及其对应的副词。
我做了什么:
输入文字:“"He is talking weirdly. A horse can run fast. A big tree is there. The sun is beautiful. The place is well decorated.They are talking weirdly. She runs fast. She is talking greatly.Jack runs slow."”
代码:-
`finder2 = BigramCollocationFinder.from_words(wrd for (wrd,tags) in posTagged if tags in('VBG','RB','VBN',))
scored = finder2.score_ngrams(bigram_measures.raw_freq)
print sorted(finder2.nbest(bigram_measures.raw_freq, 5))`
从我的代码中,我得到了输出:
[('talking', 'greatly'), ('talking', 'weirdly'), ('weirdly', 'talking'),('runs','fast'),('runs','slow')]
这是动词及其对应副词的列表。
我要找的是:
我想从中找出动词和所有对应的副词。例如('talking'- 'greatly','weirdly),('runs'-'fast','slow')etc.
我认为您正在丢失为此所需的信息。您需要以某种方式保留 part-of-speech 数据,以便可以以正确的方式处理像 ('weirdly', 'talking')
这样的二元组。
可能是bigram finder 可以接受标记词元组(我对nltk 不熟悉)。或者,您可能不得不求助于创建外部索引。如果是这样,这样的事情可能会奏效:
part_of_speech = {word:tag for word,tag in posTagged}
best_bigrams = finger2.nbest(... as you like it ...)
verb_first_bigrams = [b if part_of_speech[b[1]] == 'RB' else (b[1],b[0]) for b in best_bigrams]
然后,加上前面的动词,你可以把它变成字典或list-of-lists或其他什么:
adverbs_for = {}
for verb,adverb in verb_first_bigrams:
if verb not in adverbs_for:
adverbs_for[verb] = [adverb]
else:
adverbs_for[verb].append(adverb)
您已经有了所有动词-副词二元组的列表,所以您只是想问如何将它们合并到一个字典中,该字典为每个动词提供 所有 个副词。但首先让我们以更直接的方式重新创建您的双字母组:
pairs = list()
for (w1, tag1), (w2, tag2) in nltk.bigrams(posTagged):
if t1.startswith("VB") and t2 == "RB":
pairs.append((w1, w2))
现在回答你的问题:我们将用每个动词后面的副词构建一个字典。我会将副词存储在一个集合中,而不是一个列表中,以消除重复。
from collections import defaultdict
consolidated = defaultdict(set)
for verb, adverb in pairs:
consolidated[verb].add(adverb)
defaultdict
为以前没有见过的动词提供了一个空集,所以我们不需要手工检查。
根据作业的详细信息,您可能还需要对动词进行大小写折叠和词形还原,以便将 "Driving recklessly" 和 "I drove carefully" 中的副词记录在一起:
wnl = nltk.stem.WordNetLemmatizer()
...
for verb, adverb in pairs:
verb = wnl.lemmatize(verb.lower(), "v")
consolidated[verb].add(adverb)
在 Python 中使用 ngram 我的目标是从输入文本中找出动词及其对应的副词。 我做了什么:
输入文字:“"He is talking weirdly. A horse can run fast. A big tree is there. The sun is beautiful. The place is well decorated.They are talking weirdly. She runs fast. She is talking greatly.Jack runs slow."” 代码:-
`finder2 = BigramCollocationFinder.from_words(wrd for (wrd,tags) in posTagged if tags in('VBG','RB','VBN',))
scored = finder2.score_ngrams(bigram_measures.raw_freq)
print sorted(finder2.nbest(bigram_measures.raw_freq, 5))`
从我的代码中,我得到了输出:
[('talking', 'greatly'), ('talking', 'weirdly'), ('weirdly', 'talking'),('runs','fast'),('runs','slow')]
这是动词及其对应副词的列表。
我要找的是:
我想从中找出动词和所有对应的副词。例如('talking'- 'greatly','weirdly),('runs'-'fast','slow')etc.
我认为您正在丢失为此所需的信息。您需要以某种方式保留 part-of-speech 数据,以便可以以正确的方式处理像 ('weirdly', 'talking')
这样的二元组。
可能是bigram finder 可以接受标记词元组(我对nltk 不熟悉)。或者,您可能不得不求助于创建外部索引。如果是这样,这样的事情可能会奏效:
part_of_speech = {word:tag for word,tag in posTagged}
best_bigrams = finger2.nbest(... as you like it ...)
verb_first_bigrams = [b if part_of_speech[b[1]] == 'RB' else (b[1],b[0]) for b in best_bigrams]
然后,加上前面的动词,你可以把它变成字典或list-of-lists或其他什么:
adverbs_for = {}
for verb,adverb in verb_first_bigrams:
if verb not in adverbs_for:
adverbs_for[verb] = [adverb]
else:
adverbs_for[verb].append(adverb)
您已经有了所有动词-副词二元组的列表,所以您只是想问如何将它们合并到一个字典中,该字典为每个动词提供 所有 个副词。但首先让我们以更直接的方式重新创建您的双字母组:
pairs = list()
for (w1, tag1), (w2, tag2) in nltk.bigrams(posTagged):
if t1.startswith("VB") and t2 == "RB":
pairs.append((w1, w2))
现在回答你的问题:我们将用每个动词后面的副词构建一个字典。我会将副词存储在一个集合中,而不是一个列表中,以消除重复。
from collections import defaultdict
consolidated = defaultdict(set)
for verb, adverb in pairs:
consolidated[verb].add(adverb)
defaultdict
为以前没有见过的动词提供了一个空集,所以我们不需要手工检查。
根据作业的详细信息,您可能还需要对动词进行大小写折叠和词形还原,以便将 "Driving recklessly" 和 "I drove carefully" 中的副词记录在一起:
wnl = nltk.stem.WordNetLemmatizer()
...
for verb, adverb in pairs:
verb = wnl.lemmatize(verb.lower(), "v")
consolidated[verb].add(adverb)