使用 spacy 替换实体及其实体标签

Replacement entity with their entity label using spacy

我想通过使用 Spacy 将每个实体替换为其标签来处理我的数据,我需要 3000 个文本行来将实体替换为其标签实体,

例如:

"Georgia recently became the first U.S. state to "ban Muslim culture."

也想变成这样:

"GPE recently became the ORDINAL GPE state to "ban NORP culture. "

我想要代码替换多行文本。

非常感谢。

例如这些代码但是对于一个句子,我想将 s (string) 修改为 column contains 3000 rows

第一个:来自()

s= "His friend Nicolas J. Smith is here with Bart Simpon and Fred."
doc = nlp(s)
newString = s
for e in reversed(doc.ents): #reversed to not modify the offsets of other entities when substituting
    start = e.start_char
    end = start + len(e.text)
    newString = newString[:start] + e.label_ + newString[end:]
print(newString)
#His friend PERSON is here with PERSON and PERSON.

第二个:来自()

import spacy

nlp = spacy.load("en_core_web_sm")
s ="Apple is looking at buying U.K. startup for  billion"
doc = nlp(s)

def replaceSubstring(s, replacement, position, length_of_replaced):
    s = s[:position] + replacement + s[position+length_of_replaced:]
    return(s)

for ent in reversed(doc.ents):
    #print(ent.text, ent.start_char, ent.end_char, ent.label_)
    replacement = "<{}>{}</{}>".format(ent.label_,ent.text, ent.label_)
    position = ent.start_char
    length_of_replaced = ent.end_char - ent.start_char 
    s = replaceSubstring(s, replacement, position, length_of_replaced)

print(s)
#<ORG>Apple</ORG> is looking at buying <GPE>U.K.</GPE> startup for <MONEY> billion</MONEY>

IIUC,你可以通过以下方式实现你想要的:

  1. 正在从文件中读取您的文本,每行文本
  2. 通过用实体标签替换实体(如果有)来处理结果
  3. 正在将结果写入光盘,每行文本

演示:

import spacy
nlp = spacy.load("en_core_web_md")

#read txt file, each string on its own line
with open("./try.txt","r") as f:
    texts = f.read().splitlines()

#substitute entities with their TAGS
docs = nlp.pipe(texts)
out = []
for doc in docs:
    out_ = ""
    for tok in doc:
        text = tok.text
        if tok.ent_type_:
            text = tok.ent_type_
        out_ += text + tok.whitespace_
    out.append(out_)

# write to file
with open("./out_try.txt","w") as f:
    f.write("\n".join(out))

输入文件的内容:

Georgia recently became the first U.S. state to "ban Muslim culture.
His friend Nicolas J. Smith is here with Bart Simpon and Fred.
Apple is looking at buying U.K. startup for billion

输出文件的内容:

GPE recently became the ORDINAL GPE state to "ban NORP culture.
His friend PERSON PERSON PERSON is here with PERSON PERSON and PERSON.
ORG is looking at buying GPE startup for MONEYMONEY MONEY

注意 MONEYMONEY 模式。

这是因为:

doc = nlp("Apple is looking at buying U.K. startup for  billion")
for tok in doc:
    print(f"{tok.text}, {tok.ent_type_}, whitespace='{tok.whitespace_}'")

Apple, ORG, whitespace=' '
is, , whitespace=' '
looking, , whitespace=' '
at, , whitespace=' '
buying, , whitespace=' '
U.K., GPE, whitespace=' '
startup, , whitespace=' '
for, , whitespace=' '
$, MONEY, whitespace='' # <-- no whitespace between $ and 1
1, MONEY, whitespace=' '
billion, MONEY, whitespace=''