将输入数据处理为自定义 NER BERT 模型的正确格式

Process input data to a correct format for a custom NER BERT model

我想训练自定义 NER BERT 模型。因此我需要以某种方式处理我的输入数据。

我的 df_input 看起来像这样:

df_input = pd.DataFrame({'DocumentText': ['This is a doc 12 ab', 'document 13 a xx'],
                         'KeyWord1' : ['doc', 'document'],
                         'KeyWord2' : ['12', '13'],
                         'KeyWord3' : ['ab', 'xx']
                        })
DocumentText                KeyWord1       KeyWord2      KeyWord3
This is a doc 12 ab         doc            12           ab
document 13 a xx            document       13            xx
....

DocumentText 列中的所有文本都应标记化。然后所有的标记都应该收到标签 O 并且每个与 KeyWord 列匹配的标记都应该收到与列名对应的标签。

它应该是什么样子:

Word       DocNr      Tag
This       1          O
is         1          O
a          1          O
doc        1          KeyWord1
12         1          KeyWord2
ab         1          KeyWord3
document   2          KeyWord1
13         2          KeyWord2
a          2          O
xx         2          KeyWord3

我的代码可以运行,但是速度很慢;这需要很多小时。 所以在使用 for 循环后,我尝试了带有 lambda 函数的 apply 方法,但我被卡在了那个方法上,因为它给出了 Series 对象返回,每行的每个文档都有一个 DataFrame

def preprocess(doctext, docnr, keyword1, keyword2, keyword3):
    df1 = pd.DataFrame(columns = ['Word'])
    df1['Word'] = nltk.word_tokenize(str(doctext))
    df1['DocNR'] = docnr
    df1['Tag'] = 'O'
    df1['Tag'][df1['Word'] == keyword1] = 'KeyWord1'
    df1['Tag'][df1['Word'] == keyword2] = 'KeyWord2'
    df1['Tag'][df1['Word'] == keyword3] = 'KeyWord3'
    return df1

for i in range(0, 50000):
    try:
        df = df.append(preprocess(df_input['DocumentText'][i], 
                                  i+1,
                                  df_input['KeyWord1'][i],
                                  df_input['KeyWord2'][i],
                                  df_input['KeyWord3'][i]),
                                  ignore_index=True)
pd.DataFrame(df_input.apply(lambda row: preprocess(row['DocumentText'], 
                                        row.name,
                                        row['KeyWord1'],
                                        row['KeyWord2'],
                                        row['KeyWord3']),
                                        axis=1))[0]

还有其他方法可以快速达到这个结果吗?

这应该很快:

e = df.assign(DocumentText=df['DocumentText'].str.split('\s+')).explode('DocumentText')
keywords = e.filter(like='KeyWord')
col_idxes = np.sum((e['DocumentText'].to_numpy()[:, None] == keywords.to_numpy()) * np.arange(1,keywords.shape[1]+1), axis=1)
tags = np.array(['O', *keywords.columns])[col_idxes]
out = e[['DocumentText']].assign(DocNr=e.index+1, Tag=tags).reset_index(drop=True)

输出:

>>> out
  DocumentText  DocNr       Tag
0         This      1         O
1           is      1         O
2            a      1         O
3          doc      1  KeyWord1
4           12      1  KeyWord2
5           ab      1  KeyWord3
6     document      2  KeyWord1
7           13      2  KeyWord2
8            a      2         O
9           xx      2  KeyWord3

您可以尝试先用 df.apply 标记每个条目,然后将单词与关键字匹配:

import pandas as pd
import nltk
nltk.download('punkt')
df1 = pd.DataFrame({'DocumentText': ['This is a doc 12 ab', 'document 13 a xx'],
                         'KeyWord1' : ['doc', 'document'],
                         'KeyWord2' : ['12', '13'],
                         'KeyWord3' : ['ab', 'xx']
                        })

df = pd.DataFrame(data={'word': df1.DocumentText.apply(nltk.word_tokenize)})
df.index += 1
df = df.explode('word')
df = df.rename_axis('DocNr')
df['Tag'] = 0
df.loc[df['word'].isin(df1['KeyWord1'].to_numpy()), 'Tag'] = 'KeyWord1'
df.loc[df['word'].isin(df1['KeyWord2'].to_numpy()), 'Tag'] = 'KeyWord2'
df.loc[df['word'].isin(df1['KeyWord3'].to_numpy()), 'Tag'] = 'KeyWord3'
           word       Tag
DocNr                    
1          This         0
1            is         0
1             a         0
1           doc  KeyWord1
1            12  KeyWord2
1            ab  KeyWord3
2      document  KeyWord1
2            13  KeyWord2
2             a         0
2            xx  KeyWord3

使用 df.explodeSeries.map 以获得更好的性能:

In [736]: df_input.DocumentText = df_input.DocumentText.str.split()

In [713]: x = df_input.explode('DocumentText')
In [715]: y = x.iloc[:, 1:].drop_duplicates()

In [716]: d = {i:y[i].values.tolist() for i in y.columns}

In [733]: d = {i:k for k,v in d.items() for i in v}

In [724]: x['tags'] = x.DocumentText.map(d).fillna(0)
In [726]: x['DocNr'] = x.index + 1
In [730]: res = x[['DocumentText', 'DocNr', 'tags']].reset_index(drop=True)

In [731]: res
Out[731]: 
  DocumentText  DocNr      tags
0         This      1         0
1           is      1         0
2            a      1         0
3          doc      1  KeyWord1
4           12      1  KeyWord2
5           ab      1  KeyWord3
6     document      2  KeyWord1
7           13      2  KeyWord2
8            a      2         0
9           xx      2  KeyWord3