搜索特定的词性（例如名词）并将它们与前面的词一起打印

Question

我有一个由基本句子列表组成的文本，例如 "she is a doctor", "he is a good person “，等等。我正在尝试编写一个程序，它将 return 只有名词和前面的代词（例如她、他、它）。我需要它们成对打印，例如 (she, doctor) 或 (he, person)。我正在使用 SpaCy，因为这样我也可以处理类似的法语和德语文本。

是我在本网站其他地方找到的最接近我需要的东西。到目前为止我一直在尝试的是在文本中生成一个名词列表，然后在文本中搜索列表中的名词，并打印名词和它前面 3 个位置的单词（因为这是大多数的模式句子，大多数都足以满足我的目的）。这就是我创建列表所得到的：

def spacy_tag(text):
  text_open = codecs.open(text, encoding='latin1').read()
  parsed_text = nlp_en(text_open)
  tokens = list([(token, token.tag_) for token in parsed_text])
  list1 = []
  for token, token.tag_ in tokens:
    if token.tag_ == 'NN':
      list1.append(token)
  return(list1)

但是，当我尝试用它做任何事情时，我收到一条错误消息。我试过使用 enumerate 但我也无法让它工作。这是我在文本中搜索列表中单词的当前代码（我还没有开始添加应该预先在几个地方打印单词的部分，因为我仍然停留在搜索部分）：

def spacy_search(text, list):
  text_open = codecs.open(text, encoding='latin1').read()
  for word in text_open:
   if word in list:
     print(word)

我得到的错误是在第 4 行，"if word in list:", and it says "TypeError: Argument 'other' has incorrect type (expected spacy.tokens.token.Token, got str)"

使用SpaCy打印PRP、NN pair是否有更有效的方法？或者，我如何修改我的代码以使其在文本中搜索列表中的名词？（不需要特别优雅的方案，出结果即可）

Answer 1

你采取了错误的方法：

首先在句子中追加所有token属性：

tokonized=[]
for token in doc:
 tokonized.append((token.text ,token.lemma_, token.pos_, token.tag_, token.dep_,
                    token.shape_, token.is_alpha, token.is_stop,token.head,token.left_edge,token.right_edge,token.ent_type_))

写一个接收令牌的函数和return它相关的头并检查 if Token pos == 'NOUN' and tag== 'NN'

Head=''
if token[2]=='NOUN' and token[3]=='NN': 
 return token[8]

现在，如果 return head 是一个 PRON，您就找到了您要找的东西，如果不是，请再次将 head 令牌发送到该函数。

您可以在下面看到运行示例：

sentences=["she is a doctor", "he is a good person"]

('she', 'she', 'PRON', 'PRP', 'nsubj', 'xxx', True, True, is, she, she, '')
('is', 'be', 'AUX', 'VBZ', 'ROOT', 'xx', True, True, is, she, doctor, '')
('a', 'a', 'DET', 'DT', 'det', 'x', True, True, doctor, a, a, '')
('doctor', 'doctor', 'NOUN', 'NN', 'attr', 'xxxx', True, False, is, a, doctor, '')

所以第一个电话会 return 是，第二个电话会 return 她然后你停止。

同样适用于：

('he', 'he', 'PRON', 'PRP', 'nsubj', 'xx', True, True, is, he, he, '')
('is', 'be', 'AUX', 'VBZ', 'ROOT', 'xx', True, True, is, he, person, '')
('a', 'a', 'DET', 'DT', 'det', 'x', True, True, person, a, a, '')
('good', 'good', 'ADJ', 'JJ', 'amod', 'xxxx', True, False, person, good, good, '')
('person', 'person', 'NOUN', 'NN', 'attr', 'xxxx', True, False, is, a, person, '')

所以第一次调用会 return 是，第二次调用会 return 他然后你停止。

Answer 2

这是实现您的预期方法的简洁方法。

# put your nouns of interest here
NOUN_LIST = ["doctor", ...]

def find_stuff(text):
    doc = nlp(text)
    if len(doc) < 4: return None # too short
    
    for tok in doc[3:]:
        if tok.pos_ == "NOUN" and tok.text in NOUN_LIST and doc[tok.i-3].pos_ == "PRON":
            return (doc[tok.i-3].text, tok.text)

正如另一个答案提到的，你这里的方法是错误的。您需要句子的主语和宾语（或谓语主格）。你应该为此使用 DependencyMatcher 。这是一个例子：

from spacy.matcher import DependencyMatcher
import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("she is a good person")

pattern = [
  # anchor token: verb, usually "is"
  {
    "RIGHT_ID": "verb",
    "RIGHT_ATTRS": {"POS": "AUX"}
  },
  # verb -> pronoun
  {
    "LEFT_ID": "verb",
    "REL_OP": ">",
    "RIGHT_ID": "pronoun",
    "RIGHT_ATTRS": {"DEP": "nsubj", "POS": "PRON"}
  },
  # predicate nominatives have "attr" relation
  {
    "LEFT_ID": "verb",
    "REL_OP": ">",
    "RIGHT_ID": "target",
    "RIGHT_ATTRS": {"DEP": "attr", "POS": "NOUN"}
  }
]

matcher = DependencyMatcher(nlp.vocab)
matcher.add("PREDNOM", [pattern])
matches = matcher(doc)

for match_id, (verb, pron, target) in matches:
    print(doc[pron], doc[verb], doc[target])

您可以使用displacy. You can learn more about what they are in the Jurafsky and Martin book检查依赖关系。

搜索特定的词性（例如名词）并将它们与前面的词一起打印

Search for particular parts of speech (e.g. nouns) and print them along with a preceding word

python

pos-tagger

spacy