Python:如何将 spacy 的输出分配给元组列表,然后转换为 DataFrame?

Python: how to assign output from spacy to a list of tuples and then convert to a DataFrame?

我正在尝试将 for 循环的打印输出分配给变量 parsed_generics

这是打印输出:

import spacy

nlp = spacy.load("en")
doc = nlp(generics)
for chunk in doc.noun_chunks:
    print(chunk.text, chunk.root.text, chunk.root.dep_, chunk.root.head.text)

Aerobics Aerobics nsubj is
a form form attr is
physical exercise exercise pobj of
rhythmic aerobic exercise exercise dobj combines
stretching and strength training routines routines pobj with
the goal goal pobj with
all elements elements dobj improving
...

要将其分配给变量,这是我写的:

nlp = spacy.load("en")
doc = nlp(generics)
for chunk in doc.noun_chunks:
    parsed_generics = (chunk.text, chunk.root.text, chunk.root.dep_, chunk.root.head.text)

但是当我调用 parsed_generics 时,我得到的是:

('predators', 'predators', 'pobj', 'of')

我想我期待的是一个元组列表:

[('Aerobics', 'Aerobics', 'nsubj', 'is'), ('a form', 'form', 'attr', 'is'), ('physical exercise', 'exercise', 'pobj', 'of'), ...]

我想我必须在我的 for 循环上方设置一个空列表,遍历 doc 并追加到空列表,但追加只需要 1 个参数,我有 4 个(chunk.text, chunk.root.text, chunk.root.dep_, chunk.root.head.text)

我最终想将其存储在 DataFrame 中。

如有任何意见或建议,我们将不胜感激。提前谢谢你。

您需要使用追加。您正在覆盖 parsed_generics 每次迭代,这意味着您看到的是迭代中的最后一行。

将每个迭代附加到 list,然后调用 list

result = []

nlp = spacy.load("en")
doc = nlp(generics)
for chunk in doc.noun_chunks:
    result.append((chunk.text, chunk.root.text, chunk.root.dep_, chunk.root.head.text))