将 .CSV 数据转换为用于 NER 的 CoNLL BIO 格式

Question

我在 .csv 文件中有一些数据，如下所示

sent_num = [0, 1, 2]
text = [['Jack', 'in', 'the', 'box'], ['Jack', 'in', 'the', 'box'], ['Jack', 'in', 'the', 'box']]
tags = [['B-ORG', 'I-ORG', 'I-ORG', 'I-ORG'], ['B-ORG', 'I-ORG', 'I-ORG', 'I-ORG'], ['B-ORG', 'I-ORG', 'I-ORG', 'I-ORG']]

df = pd.DataFrame(zip(sent_num, text, tags), columns=['sent_num', 'text', 'tags'])
df

我想将该数据转换为 CoNLL 格式的文本文件，如下所示，其中每一列（文本和标签）由制表符分隔，并且每个句子（或文档）的结尾用一个空行表示。

text    tags
Jack    B-ORG
in  I-ORG
the I-ORG 
box I-ORG

Jack    B-ORG
in  I-ORG
the I-ORG 
box I-ORG

Jack    B-ORG
in  I-ORG
the I-ORG
box I-ORG

我试过但没有成功，它将空行计算为有效数据，而不是句子的结尾。

# create a three-column dataset
DF = df.apply(pd.Series.explode)
DF.head()

# insert space between rows in the data frame
# find the indices where changes occur 
switch = DF['sent_num'].ne(DF['sent_num'].shift(-1))

# construct a new empty dataframe and shift index by .5
DF1 = pd.DataFrame('', index=switch.index[switch] + .1, columns=DF.columns)

# concatenate old and new dataframes and sort by index, reset index and remove row positions by iloc
DF2 = pd.concat([DF, DF1]).sort_index().reset_index(drop=True).iloc[:-1]
DF2.head()

group by tags
DF2[['text', 'tags']].groupby('tags').count()

我正在寻求一些帮助来修改或改进我的代码。

Answer 1

with open("output.txt", "w") as f_out:
    print("text\ttags", file=f_out)
    for _, line in df.iterrows():
        for txt, tag in zip(line["text"], line["tags"]):
            print("{}\t{}".format(txt, tag), file=f_out)
        print(file=f_out)

创建 output.txt:

text    tags
Jack    B-ORG
in  I-ORG
the I-ORG
box I-ORG

Jack    B-ORG
in  I-ORG
the I-ORG
box I-ORG

Jack    B-ORG
in  I-ORG
the I-ORG
box I-ORG

将 .CSV 数据转换为用于 NER 的 CoNLL BIO 格式

Convert .CSV data into CoNLL BIO format for NER

text

named-entity-recognition

python-3.x

pandas

conll