在 SpaCy 中将 CSV 输入到自定义 NER 模型

Question

对 ML 和 Python 非常陌生，感谢对此问题的任何帮助。我已经使用 Prodigy（基于 en_core_web_lg）训练了一个 NER 模型并将模型保存到我的虚拟环境中：

我在 Windows 10 上使用 CONDA/VSCODE，SpaCy 2.x 环境，我现在正在尝试加载一个逗号分隔的 CSV 文件，如下所示：

nlp = spacy.load("en_core_web_lg", disable=["ner"]) #remove NER of base model
print(nlp.pipe_names) #check to see if removed
nlp_entity = spacy.load("tmp_model", vocab=nlp.vocab) #load my tmp model
nlp.add_pipe(nlp_entity.get_pipe("ner")) #add back NER
print(nlp.pipe_names) #check to see if it was added back
nlp.to_disk("./tmp_model2") #save combo as a new model name

nlp=spacy.load("tmp_model2") #load new model
doc=nlp("Paragraph Text Here") #test the model with this text to see if its working
print(doc.text)
for ent in doc.ents: #for all entities in doc
     print(ent.label_, ent.text) #get the label and text

从这里开始，这就是我卡住的地方。我对自己说，我可以像这样读入 CSV 文件：

input = pd.read_csv('myfile.csv') #read in CSV via Pandas
doc=nlp(input['Text']) #look for "Text" column in the CSV file and run the model for each row
for ent in doc.ents:
     print(ent.label_, ent.text)

TypeError：参数 'string' 的类型不正确（应为 str，得到系列）

对 Python 来说也是非常新的，但我想我需要将 Pandas 数据帧转换为字符串？如果是这样，我该怎么做？

Answer 1

nlp 接受字符串作为输入，你是对的。

如果你想在一个段落上使用它，你可以这样做：

doc=nlp(input['Text'].values[0])

其中 0 是段落的编号。

Answer 2

在 Andrey post 的帮助下，我能够找出合适的语法来吐出所有行。

input = pd.read_csv('MyFile.csv')
row_nums = len(input.index)
print("Number of rows is: ", len(input.index))
for x in range (0,row_nums):
    print(x, " LOOP START")
    doc=nlp(input['Text'].values[x])
    print(doc.text)
    for ent in doc2.ents:
        print(ent.label_, ent.text)

下一步是让我弄清楚如何将其推送回 CSV 文件！

在 SpaCy 中将 CSV 输入到自定义 NER 模型

Input CSV to Custom NER Model in SpaCy

python

machine-learning

spacy