计算 POS 标记模式的出现次数

Question

因此，我已将 POS 标记应用于数据框中的其中一列。对于每个句子，我想计算这种模式的出现次数：NNP, MD, VB.

比如我有下面这句话：委托人与承包商之间的交流应使用英语

词性标记将是： (通讯, NNS), (between,IN), (the, DT), (Principal, NNP), (and, CC), (the, DT), (Contractor, NNP), (shall, MD), (be,VB), (in, DT), (the, DT), (English, JJ), (language, NN).

请注意，在词性标注结果中，模式 (NNP, MD, VB) 存在并出现了 1 次。我想在 df 中为这个出现次数创建一个新列。

有什么办法可以做到这一点吗？

提前致谢

Answer 1

一个简单的计数器函数将执行您想要的！

输入：

df = pd.DataFrame({'POS':['(communications, NNS), (between,IN), (the, DT), (Principal, NNP), (and, CC), (the, DT), (Contractor, NNP), (shall, MD), (be,VB), (in, DT), (the, DT), (English, JJ), (language, NN)', '(Contractor, NNP), (shall, MD), (be,VB), (communications, NNS), (between,IN), (the, DT), (Principal, NNP), (and, CC), (the, DT), (Contractor, NNP), (shall, MD), (be,VB), (in, DT), (the, DT), (English, JJ), (language, NN)', '(and, CC), (the, DT)']})

函数：

def counter(pos):
    words, tags = [], []
    for item in pos.split('), ('):
        temp = item.strip(' )(')
        word, tag = temp.split(',')[0], temp.split(',')[-1].strip()
        words.append(word); tags.append(tag)
    length = len(tags)
    if length<3:
        return 0
    count = 0
    for idx in range(length):
        if tags[idx:idx+3]==['NNP', 'MD', 'VB']:
            count+=1
    return count

输出：

df['occ'] = df['POS'].apply(counter)
df

    POS     occ
0   (communications, NNS), (between,IN), (the, DT)...   1
1   (Contractor, NNP), (shall, MD), (be,VB), (comm...   2
2   (and, CC), (the, DT)    0

计算 POS 标记模式的出现次数

count the occurrences of POS tagging pattern

python

nlp

pos-tagger

dataframe