如何在 Pandas 中调用链式 .agg() 和 .assign() 函数

Question

我希望在 Pandas 中复制此 Dplyr 查询，但在链接 .agg() 和 .assign() 时遇到问题 共同发挥作用，如有任何建议，将不胜感激

Dplyr代码：

counties_selected %>%
  group_by(state) %>%
  summarize(total_area = sum(land_area),
            total_population = sum(population)) %>%
  mutate(density = total_population / total_area) %>%
  arrange(desc(density))

尝试在 Pandas:
在 .assign() 部分中，我将变量重定向回原始数据帧，但没有其他工作

counties.\
   groupby('state').\
   agg(total_area = ('land_area', 'sum'),
       total_population = ('population', 'sum')).\
   reset_index().\
   assign(density = counties['total_population'] / counties['total_area']).\
   arrange('density', ascending = False).\
   head()

Answer 1

问题是您需要 lambda 来处理链式数据，已经在以前的链式方法中进行了处理：

assign(density = counties['total_population'] / counties['total_area'])

至：

assign(density = lambda x: x['total_population'] / x['total_area'])

另一个问题是排序被改用了：

arrange('density', ascending = False)

方法DataFrame.sort_values:

sort_values('density', ascending = False):

总的来说，. 用于启动以下方法：

df = (counties.groupby('state')
              .agg(total_area = ('land_area', 'sum'),
                   total_population = ('population', 'sum'))
              .reset_index()
              .assign(density = lambda x: x['total_population'] / x['total_area'])
              .sort_values('density', ascending = False)
              .head())

Answer 2

使用 datar，可以轻松地将 dplyr 代码移植到 python 代码，无需学习 pandas API：

from datar.all import f, group_by, summarize, sum, mutate, arrange, desc

counties_selected >> \
  group_by(f.state) >> \
  summarize(total_area = sum(f.land_area),
            total_population = sum(f.population)) >> \
  mutate(density = f.total_population / f.total_area) >> \
  arrange(desc(f.density))

我是包的作者。有问题欢迎提issue

如何在 Pandas 中调用链式 .agg() 和 .assign() 函数

How to method chain .agg() and .assign() functions in Pandas

pandas

data-manipulation

method-chaining