如何有效地将 pos_tag_sents() 应用于 pandas 数据框

How to apply pos_tag_sents() to pandas dataframe efficiently

在您希望对存储在 pandas 数据框中的一列文本进行 POS 标记的情况下,每行 1 个句子,SO 上的大多数实现都使用 apply 方法

dfData['POSTags']= dfData['SourceText'].apply(
                 lamda row: [pos_tag(word_tokenize(row) for item in row])

用于有效标记多个句子的 NLTK 文档 recommends using the pos_tag_sents()

这是否适用于此示例?如果适用,代码是否会像将 pso_tag 更改为 pos_tag_sents 一样简单,或者 NLTK 是否表示段落的文本源

正如评论中提到的那样 pos_tag_sents() 旨在每次减少 preceptor 的负载 但问题是如何做到这一点并且仍然在 pandas 中生成一个列数据框?

Link to Sample Dataset 20kRows

通过在每一行上应用 pos_tag,每次都会加载 Perceptron 模型(昂贵的操作,因为它从磁盘读取 pickle)。

如果您改为获取所有行并将它们发送到 pos_tag_sents(需要 list(list(str))),模型将加载一次并用于所有行。

参见source

改为将此分配给您的新列:

dfData['POSTags'] = pos_tag_sents(dfData['SourceText'].apply(word_tokenize).tolist())

输入

$ cat test.csv 
ID,Task,label,Text
1,Collect Information,no response,cozily married practical athletics Mr. Brown flat
2,New Credit,no response,active married expensive soccer Mr. Chang flat
3,Collect Information,response,healthy single expensive badminton Mrs. Green flat
4,Collect Information,response,cozily married practical soccer Mr. Brown hierachical
5,Collect Information,response,cozily single practical badminton Mr. Brown flat

TL;DR

>>> from nltk import word_tokenize, pos_tag, pos_tag_sents
>>> import pandas as pd
>>> df = pd.read_csv('test.csv', sep=',')
>>> df['Text']
0    cozily married practical athletics Mr. Brown flat
1       active married expensive soccer Mr. Chang flat
2    healthy single expensive badminton Mrs. Green ...
3    cozily married practical soccer Mr. Brown hier...
4     cozily single practical badminton Mr. Brown flat
Name: Text, dtype: object
>>> texts = df['Text'].tolist()
>>> tagged_texts = pos_tag_sents(map(word_tokenize, texts))
>>> tagged_texts
[[('cozily', 'RB'), ('married', 'JJ'), ('practical', 'JJ'), ('athletics', 'NNS'), ('Mr.', 'NNP'), ('Brown', 'NNP'), ('flat', 'JJ')], [('active', 'JJ'), ('married', 'VBD'), ('expensive', 'JJ'), ('soccer', 'NN'), ('Mr.', 'NNP'), ('Chang', 'NNP'), ('flat', 'JJ')], [('healthy', 'JJ'), ('single', 'JJ'), ('expensive', 'JJ'), ('badminton', 'NN'), ('Mrs.', 'NNP'), ('Green', 'NNP'), ('flat', 'JJ')], [('cozily', 'RB'), ('married', 'JJ'), ('practical', 'JJ'), ('soccer', 'NN'), ('Mr.', 'NNP'), ('Brown', 'NNP'), ('hierachical', 'JJ')], [('cozily', 'RB'), ('single', 'JJ'), ('practical', 'JJ'), ('badminton', 'NN'), ('Mr.', 'NNP'), ('Brown', 'NNP'), ('flat', 'JJ')]]

>>> df['POS'] = tagged_texts
>>> df
   ID                 Task        label  \
0   1  Collect Information  no response   
1   2           New Credit  no response   
2   3  Collect Information     response   
3   4  Collect Information     response   
4   5  Collect Information     response   

                                                Text  \
0  cozily married practical athletics Mr. Brown flat   
1     active married expensive soccer Mr. Chang flat   
2  healthy single expensive badminton Mrs. Green ...   
3  cozily married practical soccer Mr. Brown hier...   
4   cozily single practical badminton Mr. Brown flat   

                                                 POS  
0  [(cozily, RB), (married, JJ), (practical, JJ),...  
1  [(active, JJ), (married, VBD), (expensive, JJ)...  
2  [(healthy, JJ), (single, JJ), (expensive, JJ),...  
3  [(cozily, RB), (married, JJ), (practical, JJ),...  
4  [(cozily, RB), (single, JJ), (practical, JJ), ... 

多头:

首先,您可以将 Text 列提取到字符串列表中:

texts = df['Text'].tolist()

然后你可以应用word_tokenize函数:

map(word_tokenize, texts)

请注意,@Boud 的建议几乎相同,使用 df.apply:

df['Text'].apply(word_tokenize)

然后将标记化的文本转储到字符串列表中:

df['Text'].apply(word_tokenize).tolist()

那么你可以使用pos_tag_sents:

pos_tag_sents( df['Text'].apply(word_tokenize).tolist() )

然后将该列添加回 DataFrame:

df['POS'] = pos_tag_sents( df['Text'].apply(word_tokenize).tolist() )