名词与动词的距离

Question

有没有办法使用 NLTK 和 Python 从 csv 文件中的多个句子中获取名词与动词的距离？

.csv 文件中的句子示例：

video shows adam stabbing the bystander.
woman quickly ran from the police after the incident.

输出：

第 1 句：1 (Verb is right after the noun)

第二句： 2 (Verb is after another POS tag)

Answer 1

第一个动词和前一个名词之间的距离

受到非常相似的问题的启发。

import nltk

def dist_noun_verb(text):
    text = nltk.word_tokenize(text)
    pos_tagged = nltk.pos_tag(text)
    last_noun_pos = None
    for pos, (word, function) in enumerate(pos_tagged):
        if function.startswith('NN'):
            last_noun_pos = pos
        elif function.startswith('VB'):
            assert(last_noun_pos is not None)
            return pos - last_noun_pos

for sentence in ['Video show Adam stabbing the bystander.', 'Woman quickly ran from the police after the incident.']:
    print(sentence)
    d = dist_noun_verb(sentence)
    print('Distance noun-verb: ', d)

输出：

Video show Adam stabbing the bystander.
Distance noun-verb:  1
Woman quickly ran from the police after the incident.
Distance noun-verb:  2

请注意 function.startswith('VB') 检测句子中的第一个动词。如果你想区分主要动词或其他种类的动词，你需要检查按 nltk.pos_tagged 分类的不同种类的动词：'VBP'、'VBD' 等

此外，我代码中的 assert(last_noun_pos is not None) 行意味着如果第一个动词出现在任何名词之前，代码就会崩溃。您可能希望以不同的方式处理它。

有趣的是，如果我在 'show' 中添加一个 's' 并构成句子 'Video shows Adam stabbing the bystander.'，那么 nltk 会将 'shows' 解析为名词而不是动词。

更进一步：“主要”动词和前一个名词之间的距离

考虑句子：

'The umbrella that I used to protect myself from the rain was red.'

这句话包含三个动词：'used', 'protect', 'was'。像我上面那样使用 nltk.word_tokenize.pos_tag 可以正确识别这三个动词：

text = 'The umbrella that I used to protect myself from the rain was red.'
tokens = nltk.word_tokenize(text)
pos_tagged = nltk.pos_tag(tokens)
print(pos_tagged)
# [('The', 'DT'), ('umbrella', 'NN'), ('that', 'IN'), ('I', 'PRP'), ('used', 'VBD'), ('to', 'TO'), ('protect', 'VB'), ('myself', 'PRP'), ('from', 'IN'), ('the', 'DT'), ('rain', 'NN'), ('was', 'VBD'), ('red', 'JJ'), ('.', '.')]
print([(w,f) for w,f in pos_tagged if f.startswith('VB')])
# [('used', 'VBD'), ('protect', 'VB'), ('was', 'VBD')]

然而，句子的主要动词是'was'；另外两个动词是构成句子主语的名词组的一部分，'The umbrella that I used to protect myself from the rain'.

因此我们可能想写一个函数 dist_subject_verb returns 主语和主要动词 'was' 之间的距离，而不是第一个动词 'used' 之间的距离和前面的名词。

识别主要动词的一种方法是将句子解析为树，并忽略位于子树中的动词，只考虑作为根的直接子动词的动词。

句子应该被解析为：

((The umbrella) (that (I used) to (protect (myself) (from (the rain))))) (was) (red)

而现在我们可以很容易地忽略深入子树的'used'和'protect'，只考虑主动词'was'.

将句子解析成树是一个比将其标记化更复杂的操作。

这是一个类似的问题，涉及将句子解析为树：

名词与动词的距离

Distance of Noun from Verb

python

nlp

distance

nltk

part-of-speech

第一个动词和前一个名词之间的距离

更进一步：“主要”动词和前一个名词之间的距离