如何使用 pandas apply() 和 lambda 遍历每一行来清理句子标记？

Question

我的目标是在现有数据框中创建一个干净的标记化句子列。数据集是一个 pandas 数据框，如下所示：

Index	Tokenized_sents
First	[Donald, Trump, just, couldn, t, wish, all, Am]
Second	[On, Friday, ,, it, was, revealed, that]

dataset['cleaned_sents'] = dataset.apply(lambda row: [w for w in row["tokenized_sents"] if len(w)>2 and w.lower() not in stop_words], axis = 1)

我当前的输出是没有那个额外列的数据框。

当前输出：

    tokenized_sents  \
0  [Donald, Trump, just, couldn, t, wish, all, Am...

想要的输出：

  tokenized_sents  \
0  [Donald, Trump, just, couldn, wish, all...

基本上删除所有停用词和短词

Answer 1

创建句子索引

dataset['gid'] = range(1, dataset.shape[0] + 1)

       tokenized_sents  gid
0  [This, is, a, test]    1
1    [and, this, too!]    2

然后分解数据框

clean_df = dataset.explode('tokenized_sents')

  tokenized_sents  gid
0            This    1
0              is    1
0               a    1
0            test    1
1             and    2
1            this    2
1            too!    2

对该数据框进行所有清理并使用 gid 列将它们分组。这将是最快的方式。

clean_df = clean_df[clean_df.tokenized_sents.str.len() >= 2]
.
.
.

要取回，

clean_dataset = clean_df.groupby('gid').agg(list)

Answer 2

修复您的代码

dataset['new'] = dataset['tokenized_sents'].\
                   map(lambda x : [t for t in x if len(t)>2 and t not in stop] )

如何使用 pandas apply() 和 lambda 遍历每一行来清理句子标记？

How to go through each row with pandas apply() and lambda to clean sentence tokens?

python

nlp

nltk

pandas