如何计算 Pandas 数据框单元格中的单词总数并将它们添加到新列中？

Question

情感分析中的一项常见任务是获取 Pandas 数据框单元格中的单词数，并根据该计数创建一个新列。我该怎么做？

Answer 1

假设您有一个使用

生成的数据框 df

df = pandas.read_csv('dataset.csv')

然后您可以通过执行以下操作添加一个包含字数的新列：

df['new_column'] = df.columnToCount.apply(lambda x: len(str(x).split(' ')))

请记住拆分中的 space 很重要，因为您要拆分的是新词。在执行此操作之前，您可能还想删除标点符号或数字并减少为小写。

df = df.apply(lambda x: x.astype(str).str.lower())
df = df.replace('\d+', '', regex = True)
df = df.replace('[^\w\s\+]', '', regex = True)

Answer 2

from collections import Counter

df['new_column'] = df['count_column'].apply(lambda x: Counter(" ".join(x).split(" ")).items())

Answer 3

假设n个单词的句子中有n-1个空格，还有另一种解法：

df['new_column'] = df['count_column'].str.count(' ') + 1

这个解决方案可能更快，因为它不会将每个字符串拆分成一个列表。

如果count_column包含空字符串，结果需要调整（见下面注释）：

df['new_column'] = np.where(df['count_column'] == '', 0, df['new_column'])

Answer 4

对于数据框 df 从所选列中删除标点符号：

string_text = df['reviews'].str
df['reviews'] = string_text.translate(str.maketrans('', '', string.punctuation))

获取字数：

df['review_word_count'] = df['reviews'].apply(word_tokenize).tolist()
df['review_word_count'] = df['review_word_count'].apply(len)

写入包含新列的 CSV：

df.to_csv('./data/dataset.csv')

How do I count the total number of words in a Pandas dataframe cell and add those to a new column?