计算 POS 标记模式的出现次数
count the occurrences of POS tagging pattern
因此,我已将 POS 标记应用于数据框中的其中一列。对于每个句子,我想计算这种模式的出现次数:NNP, MD, VB.
比如我有下面这句话:
委托人与承包商之间的交流应使用英语
词性标记将是:
(通讯, NNS), (between,IN), (the, DT), (Principal, NNP), (and, CC), (the, DT), (Contractor, NNP), (shall, MD), (be,VB), (in, DT), (the, DT), (English, JJ), (language, NN).
请注意,在词性标注结果中,模式 (NNP, MD, VB) 存在并出现了 1 次。我想在 df 中为这个出现次数创建一个新列。
有什么办法可以做到这一点吗?
提前致谢
一个简单的计数器函数将执行您想要的!
输入:
df = pd.DataFrame({'POS':['(communications, NNS), (between,IN), (the, DT), (Principal, NNP), (and, CC), (the, DT), (Contractor, NNP), (shall, MD), (be,VB), (in, DT), (the, DT), (English, JJ), (language, NN)', '(Contractor, NNP), (shall, MD), (be,VB), (communications, NNS), (between,IN), (the, DT), (Principal, NNP), (and, CC), (the, DT), (Contractor, NNP), (shall, MD), (be,VB), (in, DT), (the, DT), (English, JJ), (language, NN)', '(and, CC), (the, DT)']})
函数:
def counter(pos):
words, tags = [], []
for item in pos.split('), ('):
temp = item.strip(' )(')
word, tag = temp.split(',')[0], temp.split(',')[-1].strip()
words.append(word); tags.append(tag)
length = len(tags)
if length<3:
return 0
count = 0
for idx in range(length):
if tags[idx:idx+3]==['NNP', 'MD', 'VB']:
count+=1
return count
输出:
df['occ'] = df['POS'].apply(counter)
df
POS occ
0 (communications, NNS), (between,IN), (the, DT)... 1
1 (Contractor, NNP), (shall, MD), (be,VB), (comm... 2
2 (and, CC), (the, DT) 0
因此,我已将 POS 标记应用于数据框中的其中一列。对于每个句子,我想计算这种模式的出现次数:NNP, MD, VB.
比如我有下面这句话: 委托人与承包商之间的交流应使用英语
词性标记将是: (通讯, NNS), (between,IN), (the, DT), (Principal, NNP), (and, CC), (the, DT), (Contractor, NNP), (shall, MD), (be,VB), (in, DT), (the, DT), (English, JJ), (language, NN).
请注意,在词性标注结果中,模式 (NNP, MD, VB) 存在并出现了 1 次。我想在 df 中为这个出现次数创建一个新列。
有什么办法可以做到这一点吗?
提前致谢
一个简单的计数器函数将执行您想要的!
输入:
df = pd.DataFrame({'POS':['(communications, NNS), (between,IN), (the, DT), (Principal, NNP), (and, CC), (the, DT), (Contractor, NNP), (shall, MD), (be,VB), (in, DT), (the, DT), (English, JJ), (language, NN)', '(Contractor, NNP), (shall, MD), (be,VB), (communications, NNS), (between,IN), (the, DT), (Principal, NNP), (and, CC), (the, DT), (Contractor, NNP), (shall, MD), (be,VB), (in, DT), (the, DT), (English, JJ), (language, NN)', '(and, CC), (the, DT)']})
函数:
def counter(pos):
words, tags = [], []
for item in pos.split('), ('):
temp = item.strip(' )(')
word, tag = temp.split(',')[0], temp.split(',')[-1].strip()
words.append(word); tags.append(tag)
length = len(tags)
if length<3:
return 0
count = 0
for idx in range(length):
if tags[idx:idx+3]==['NNP', 'MD', 'VB']:
count+=1
return count
输出:
df['occ'] = df['POS'].apply(counter)
df
POS occ
0 (communications, NNS), (between,IN), (the, DT)... 1
1 (Contractor, NNP), (shall, MD), (be,VB), (comm... 2
2 (and, CC), (the, DT) 0