如何使用 pandas apply() 和 lambda 遍历每一行来清理句子标记?
How to go through each row with pandas apply() and lambda to clean sentence tokens?
我的目标是在现有数据框中创建一个干净的标记化句子列。
数据集是一个 pandas 数据框,如下所示:
Index
Tokenized_sents
First
[Donald, Trump, just, couldn, t, wish, all, Am]
Second
[On, Friday, ,, it, was, revealed, that]
dataset['cleaned_sents'] = dataset.apply(lambda row: [w for w in row["tokenized_sents"] if len(w)>2 and w.lower() not in stop_words], axis = 1)
我当前的输出是没有那个额外列的数据框。
当前输出:
tokenized_sents \
0 [Donald, Trump, just, couldn, t, wish, all, Am...
想要的输出:
tokenized_sents \
0 [Donald, Trump, just, couldn, wish, all...
基本上删除所有停用词和短词
创建句子索引
dataset['gid'] = range(1, dataset.shape[0] + 1)
tokenized_sents gid
0 [This, is, a, test] 1
1 [and, this, too!] 2
然后分解数据框
clean_df = dataset.explode('tokenized_sents')
tokenized_sents gid
0 This 1
0 is 1
0 a 1
0 test 1
1 and 2
1 this 2
1 too! 2
对该数据框进行所有清理并使用 gid
列将它们分组。这将是最快的方式。
clean_df = clean_df[clean_df.tokenized_sents.str.len() >= 2]
.
.
.
要取回,
clean_dataset = clean_df.groupby('gid').agg(list)
修复您的代码
dataset['new'] = dataset['tokenized_sents'].\
map(lambda x : [t for t in x if len(t)>2 and t not in stop] )
我的目标是在现有数据框中创建一个干净的标记化句子列。 数据集是一个 pandas 数据框,如下所示:
Index | Tokenized_sents |
---|---|
First | [Donald, Trump, just, couldn, t, wish, all, Am] |
Second | [On, Friday, ,, it, was, revealed, that] |
dataset['cleaned_sents'] = dataset.apply(lambda row: [w for w in row["tokenized_sents"] if len(w)>2 and w.lower() not in stop_words], axis = 1)
我当前的输出是没有那个额外列的数据框。
当前输出:
tokenized_sents \
0 [Donald, Trump, just, couldn, t, wish, all, Am...
想要的输出:
tokenized_sents \
0 [Donald, Trump, just, couldn, wish, all...
基本上删除所有停用词和短词
创建句子索引
dataset['gid'] = range(1, dataset.shape[0] + 1)
tokenized_sents gid
0 [This, is, a, test] 1
1 [and, this, too!] 2
然后分解数据框
clean_df = dataset.explode('tokenized_sents')
tokenized_sents gid
0 This 1
0 is 1
0 a 1
0 test 1
1 and 2
1 this 2
1 too! 2
对该数据框进行所有清理并使用 gid
列将它们分组。这将是最快的方式。
clean_df = clean_df[clean_df.tokenized_sents.str.len() >= 2]
.
.
.
要取回,
clean_dataset = clean_df.groupby('gid').agg(list)
修复您的代码
dataset['new'] = dataset['tokenized_sents'].\
map(lambda x : [t for t in x if len(t)>2 and t not in stop] )