使用 spacy 将实体替换为其实体标签的重复实体

Repeating entity in replacing entity with their entity label using spacy

代码:

import spacy
nlp = spacy.load("en_core_web_md")

#read txt file, each string on its own line
with open("./try.txt","r") as f:
    texts = f.read().splitlines()

#substitute entities with their TAGS
docs = nlp.pipe(texts)
out = []
for doc in docs:
    out_ = ""
    for tok in doc:
        text = tok.text
        if tok.ent_type_:
            text = tok.ent_type_
        out_ += text + tok.whitespace_
    out.append(out_)

# write to file
with open("./out_try.txt","w") as f:
    f.write("\n".join(out))

输入文件的内容:

Georgia recently became the first U.S. state to "ban Muslim culture.
His friend Nicolas J. Smith is here with Bart Simpon and Fred.
Apple is looking at buying U.K. startup for  billion

输出文件的内容:

GPE recently became the ORDINAL GPE state to "ban NORP culture.
His friend PERSON PERSON PERSON is here with PERSON PERSON and PERSON.
ORG is looking at buying GPE startup for MONEYMONEY MONEY

我需要在上面的句子中避免这个问题。例如 in (in sentence 2 'PERSON PERSON PERSON' 成为一个实体 PERSON. 谢谢

让我们试试:

import spacy
from spacy.gold import biluo_tags_from_offsets, spans_from_biluo_tags
nlp = spacy.load("en_core_web_md")

#read txt file, each string on its own line
with open("./try.txt","r") as f:
    texts = f.read().splitlines()

docs = nlp.pipe(texts)
out_text = ""
for doc in docs:
    offsets = []
    for ent in doc.ents:
        offsets.append((ent.start_char, ent.end_char, ent.label_))
    tags = biluo_tags_from_offsets(doc, offsets)
    text = *zip([tok for tok in doc],tags),
    out = []
    for item in text:
        tag = item[1].split("-")
        if tag[0] == "O":
            out.append(item[0].text+item[0].whitespace_)
        if tag[0] == "U":
            out.append(item[0].ent_type_+item[0].whitespace_)
        elif tag[0] == "L":
            out.append(item[0].ent_type_+item[0].whitespace_)
    out_text += "".join(out)+"\n"

with open("out_try.txt","w") as f:
    f.write(out_text)

输出文件的内容:

GPE recently became the ORDINAL GPE state to "ban NORP culture.
His friend PERSON is here with PERSON and PERSON.
ORG is looking at buying GPE startup for MONEY