将输入数据处理为自定义 NER BERT 模型的正确格式
Process input data to a correct format for a custom NER BERT model
我想训练自定义 NER BERT 模型。因此我需要以某种方式处理我的输入数据。
我的 df_input
看起来像这样:
df_input = pd.DataFrame({'DocumentText': ['This is a doc 12 ab', 'document 13 a xx'],
'KeyWord1' : ['doc', 'document'],
'KeyWord2' : ['12', '13'],
'KeyWord3' : ['ab', 'xx']
})
DocumentText KeyWord1 KeyWord2 KeyWord3
This is a doc 12 ab doc 12 ab
document 13 a xx document 13 xx
....
DocumentText
列中的所有文本都应标记化。然后所有的标记都应该收到标签 O
并且每个与 KeyWord
列匹配的标记都应该收到与列名对应的标签。
它应该是什么样子:
Word DocNr Tag
This 1 O
is 1 O
a 1 O
doc 1 KeyWord1
12 1 KeyWord2
ab 1 KeyWord3
document 2 KeyWord1
13 2 KeyWord2
a 2 O
xx 2 KeyWord3
我的代码可以运行,但是速度很慢;这需要很多小时。 所以在使用 for
循环后,我尝试了带有 lambda 函数的 apply
方法,但我被卡在了那个方法上,因为它给出了 Series
对象返回,每行的每个文档都有一个 DataFrame
。
def preprocess(doctext, docnr, keyword1, keyword2, keyword3):
df1 = pd.DataFrame(columns = ['Word'])
df1['Word'] = nltk.word_tokenize(str(doctext))
df1['DocNR'] = docnr
df1['Tag'] = 'O'
df1['Tag'][df1['Word'] == keyword1] = 'KeyWord1'
df1['Tag'][df1['Word'] == keyword2] = 'KeyWord2'
df1['Tag'][df1['Word'] == keyword3] = 'KeyWord3'
return df1
for i in range(0, 50000):
try:
df = df.append(preprocess(df_input['DocumentText'][i],
i+1,
df_input['KeyWord1'][i],
df_input['KeyWord2'][i],
df_input['KeyWord3'][i]),
ignore_index=True)
pd.DataFrame(df_input.apply(lambda row: preprocess(row['DocumentText'],
row.name,
row['KeyWord1'],
row['KeyWord2'],
row['KeyWord3']),
axis=1))[0]
还有其他方法可以快速达到这个结果吗?
这应该很快:
e = df.assign(DocumentText=df['DocumentText'].str.split('\s+')).explode('DocumentText')
keywords = e.filter(like='KeyWord')
col_idxes = np.sum((e['DocumentText'].to_numpy()[:, None] == keywords.to_numpy()) * np.arange(1,keywords.shape[1]+1), axis=1)
tags = np.array(['O', *keywords.columns])[col_idxes]
out = e[['DocumentText']].assign(DocNr=e.index+1, Tag=tags).reset_index(drop=True)
输出:
>>> out
DocumentText DocNr Tag
0 This 1 O
1 is 1 O
2 a 1 O
3 doc 1 KeyWord1
4 12 1 KeyWord2
5 ab 1 KeyWord3
6 document 2 KeyWord1
7 13 2 KeyWord2
8 a 2 O
9 xx 2 KeyWord3
您可以尝试先用 df.apply
标记每个条目,然后将单词与关键字匹配:
import pandas as pd
import nltk
nltk.download('punkt')
df1 = pd.DataFrame({'DocumentText': ['This is a doc 12 ab', 'document 13 a xx'],
'KeyWord1' : ['doc', 'document'],
'KeyWord2' : ['12', '13'],
'KeyWord3' : ['ab', 'xx']
})
df = pd.DataFrame(data={'word': df1.DocumentText.apply(nltk.word_tokenize)})
df.index += 1
df = df.explode('word')
df = df.rename_axis('DocNr')
df['Tag'] = 0
df.loc[df['word'].isin(df1['KeyWord1'].to_numpy()), 'Tag'] = 'KeyWord1'
df.loc[df['word'].isin(df1['KeyWord2'].to_numpy()), 'Tag'] = 'KeyWord2'
df.loc[df['word'].isin(df1['KeyWord3'].to_numpy()), 'Tag'] = 'KeyWord3'
word Tag
DocNr
1 This 0
1 is 0
1 a 0
1 doc KeyWord1
1 12 KeyWord2
1 ab KeyWord3
2 document KeyWord1
2 13 KeyWord2
2 a 0
2 xx KeyWord3
使用 df.explode
和 Series.map
以获得更好的性能:
In [736]: df_input.DocumentText = df_input.DocumentText.str.split()
In [713]: x = df_input.explode('DocumentText')
In [715]: y = x.iloc[:, 1:].drop_duplicates()
In [716]: d = {i:y[i].values.tolist() for i in y.columns}
In [733]: d = {i:k for k,v in d.items() for i in v}
In [724]: x['tags'] = x.DocumentText.map(d).fillna(0)
In [726]: x['DocNr'] = x.index + 1
In [730]: res = x[['DocumentText', 'DocNr', 'tags']].reset_index(drop=True)
In [731]: res
Out[731]:
DocumentText DocNr tags
0 This 1 0
1 is 1 0
2 a 1 0
3 doc 1 KeyWord1
4 12 1 KeyWord2
5 ab 1 KeyWord3
6 document 2 KeyWord1
7 13 2 KeyWord2
8 a 2 0
9 xx 2 KeyWord3
我想训练自定义 NER BERT 模型。因此我需要以某种方式处理我的输入数据。
我的 df_input
看起来像这样:
df_input = pd.DataFrame({'DocumentText': ['This is a doc 12 ab', 'document 13 a xx'],
'KeyWord1' : ['doc', 'document'],
'KeyWord2' : ['12', '13'],
'KeyWord3' : ['ab', 'xx']
})
DocumentText KeyWord1 KeyWord2 KeyWord3
This is a doc 12 ab doc 12 ab
document 13 a xx document 13 xx
....
DocumentText
列中的所有文本都应标记化。然后所有的标记都应该收到标签 O
并且每个与 KeyWord
列匹配的标记都应该收到与列名对应的标签。
它应该是什么样子:
Word DocNr Tag
This 1 O
is 1 O
a 1 O
doc 1 KeyWord1
12 1 KeyWord2
ab 1 KeyWord3
document 2 KeyWord1
13 2 KeyWord2
a 2 O
xx 2 KeyWord3
我的代码可以运行,但是速度很慢;这需要很多小时。 所以在使用 for
循环后,我尝试了带有 lambda 函数的 apply
方法,但我被卡在了那个方法上,因为它给出了 Series
对象返回,每行的每个文档都有一个 DataFrame
。
def preprocess(doctext, docnr, keyword1, keyword2, keyword3):
df1 = pd.DataFrame(columns = ['Word'])
df1['Word'] = nltk.word_tokenize(str(doctext))
df1['DocNR'] = docnr
df1['Tag'] = 'O'
df1['Tag'][df1['Word'] == keyword1] = 'KeyWord1'
df1['Tag'][df1['Word'] == keyword2] = 'KeyWord2'
df1['Tag'][df1['Word'] == keyword3] = 'KeyWord3'
return df1
for i in range(0, 50000):
try:
df = df.append(preprocess(df_input['DocumentText'][i],
i+1,
df_input['KeyWord1'][i],
df_input['KeyWord2'][i],
df_input['KeyWord3'][i]),
ignore_index=True)
pd.DataFrame(df_input.apply(lambda row: preprocess(row['DocumentText'],
row.name,
row['KeyWord1'],
row['KeyWord2'],
row['KeyWord3']),
axis=1))[0]
还有其他方法可以快速达到这个结果吗?
这应该很快:
e = df.assign(DocumentText=df['DocumentText'].str.split('\s+')).explode('DocumentText')
keywords = e.filter(like='KeyWord')
col_idxes = np.sum((e['DocumentText'].to_numpy()[:, None] == keywords.to_numpy()) * np.arange(1,keywords.shape[1]+1), axis=1)
tags = np.array(['O', *keywords.columns])[col_idxes]
out = e[['DocumentText']].assign(DocNr=e.index+1, Tag=tags).reset_index(drop=True)
输出:
>>> out
DocumentText DocNr Tag
0 This 1 O
1 is 1 O
2 a 1 O
3 doc 1 KeyWord1
4 12 1 KeyWord2
5 ab 1 KeyWord3
6 document 2 KeyWord1
7 13 2 KeyWord2
8 a 2 O
9 xx 2 KeyWord3
您可以尝试先用 df.apply
标记每个条目,然后将单词与关键字匹配:
import pandas as pd
import nltk
nltk.download('punkt')
df1 = pd.DataFrame({'DocumentText': ['This is a doc 12 ab', 'document 13 a xx'],
'KeyWord1' : ['doc', 'document'],
'KeyWord2' : ['12', '13'],
'KeyWord3' : ['ab', 'xx']
})
df = pd.DataFrame(data={'word': df1.DocumentText.apply(nltk.word_tokenize)})
df.index += 1
df = df.explode('word')
df = df.rename_axis('DocNr')
df['Tag'] = 0
df.loc[df['word'].isin(df1['KeyWord1'].to_numpy()), 'Tag'] = 'KeyWord1'
df.loc[df['word'].isin(df1['KeyWord2'].to_numpy()), 'Tag'] = 'KeyWord2'
df.loc[df['word'].isin(df1['KeyWord3'].to_numpy()), 'Tag'] = 'KeyWord3'
word Tag
DocNr
1 This 0
1 is 0
1 a 0
1 doc KeyWord1
1 12 KeyWord2
1 ab KeyWord3
2 document KeyWord1
2 13 KeyWord2
2 a 0
2 xx KeyWord3
使用 df.explode
和 Series.map
以获得更好的性能:
In [736]: df_input.DocumentText = df_input.DocumentText.str.split()
In [713]: x = df_input.explode('DocumentText')
In [715]: y = x.iloc[:, 1:].drop_duplicates()
In [716]: d = {i:y[i].values.tolist() for i in y.columns}
In [733]: d = {i:k for k,v in d.items() for i in v}
In [724]: x['tags'] = x.DocumentText.map(d).fillna(0)
In [726]: x['DocNr'] = x.index + 1
In [730]: res = x[['DocumentText', 'DocNr', 'tags']].reset_index(drop=True)
In [731]: res
Out[731]:
DocumentText DocNr tags
0 This 1 0
1 is 1 0
2 a 1 0
3 doc 1 KeyWord1
4 12 1 KeyWord2
5 ab 1 KeyWord3
6 document 2 KeyWord1
7 13 2 KeyWord2
8 a 2 0
9 xx 2 KeyWord3