如何解决 Python Pandas 创建新列时分配错误

Question

我有一个包含家庭描述的数据框：

description
0   Beautiful, spacious skylit studio in the heart...
1   Enjoy 500 s.f. top floor in 1899 brownstone, w...
2   The spaceHELLO EVERYONE AND THANKS FOR VISITIN...
3   We welcome you to stay in our lovely 2 br dupl...
4   Please don’t expect the luxury here just a bas...
5   Our best guests are seeking a safe, clean, spa...
6   Beautiful house, gorgeous garden, patio, cozy ...
7   Comfortable studio apartment with super comfor...
8   A charming month-to-month home away from home ...
9   Beautiful peaceful healthy homeThe spaceHome i...

我正在尝试计算每行的句子数（使用 nltk.tokenize 中的 sent_tokenize）并将这些值作为新列 sentence_count 附加到 df。由于这是更大数据管道的一部分，我使用 pandas assign 以便我可以链接操作。

不过，我似乎无法让它工作。我试过：

df.assign(sentence_count=lambda x: len(sent_tokenize(x['description'])))

和

df.assign(sentence_count=len(sent_tokenize(df['description'])))

但都引发以下错误：

TypeError: expected string or bytes-like object

我已确认每行的 dtype 为 str。也许是因为 description 有 dtype('O')?

我在这里做错了什么？在这里使用带有自定义函数的 pipe 效果很好，但我更喜欢使用 assign.

Answer 1

x['description'] 当你把它传递给第一个例子中的 sent_tokenize 时是 pandas.Series。这不是一个字符串。这是一个字符串系列（类似于列表）。

所以你应该这样做：

df.assign(sentence_count=df['description'].apply(sent_tokenize))

或者，如果您需要将额外的参数传递给 sent_tokenize：

df.assign(sentence_count=df['description'].apply(lambda x: sent_tokenize(x)))

如何解决 Python Pandas 创建新列时分配错误

How to solve Python Pandas assign error when creating new column

python

lambda

assign

pandas