使用 Groupby 时对多列应用相同的聚合 (python)
Apply Same Aggregation on Multiple Columns when Using Groupby (python)
当我想对多个列应用相同的功能时,我必须写出列的名称并将它们一一映射到相同的功能。当列数很大时,这可能会变得乏味。在下面的代码中,我将 3 列映射到相同的函数(“first”)。
user_id = [12, 12, 13, 13, 13]
category = ["furniture", "furniture", "electronics","electronics","electronics"]
name = ["Casey", "Casey", "Alice", "Alice", "Alice"]
payment_amount = [96, 109, 56, 0, 90]
example_df = pd.DataFrame({"user_id" : user_id, "category" : category, "name" : name, "payment_amount": payment_amount})
expected_output = example_df.groupby("user_id").agg({"user_id" : "first", "category" : "first", "name" : "first", "payment_amount": sum})
相反,我想做这样的事情并获得相同的输出:
expected_output = example_df.groupby("user_id").agg({["user_id" , "category" , "name"]: "first", "payment_amount": sum})
但这会引发错误。如何做到这一点?
可以生成dict
:
d = {**{"payment_amount": 'sum'},
**dict.fromkeys(["user_id" , "category" , "name"], 'first')}
print (d)
{'payment_amount': 'sum', 'user_id': 'first', 'category': 'first', 'name': 'first'}
expected_output = example_df.groupby("user_id").agg(d)
更通用的解决方案应该是:
d = dict.fromkeys(example_df.columns, 'first')
d['payment_amount'] = 'sum'
print (d)
{'user_id': 'first', 'category': 'first', 'name': 'first', 'payment_amount': 'sum'}
expected_output = example_df.groupby("user_id").agg(d)
默认情况下,您可以使用带有显式列列表的字典推导式 sum
和 first
:
expected_output = (
example_df.groupby('user_id')
.agg({c: 'mean' if c in ('payment_amount') else 'first'
for c in example_df})
)
输出:
user_id category name payment_amount
user_id
12 12 furniture Casey 102.500000
13 13 electronics Alice 48.666667
但是我建议也使用数据类型来选择列:
expected_output = (
example_df.groupby('user_id')
.agg({k: 'mean' if v in ('int64', 'float64') else 'first'
for k,v in example_df.dtypes[1:].items()})
)
当我想对多个列应用相同的功能时,我必须写出列的名称并将它们一一映射到相同的功能。当列数很大时,这可能会变得乏味。在下面的代码中,我将 3 列映射到相同的函数(“first”)。
user_id = [12, 12, 13, 13, 13]
category = ["furniture", "furniture", "electronics","electronics","electronics"]
name = ["Casey", "Casey", "Alice", "Alice", "Alice"]
payment_amount = [96, 109, 56, 0, 90]
example_df = pd.DataFrame({"user_id" : user_id, "category" : category, "name" : name, "payment_amount": payment_amount})
expected_output = example_df.groupby("user_id").agg({"user_id" : "first", "category" : "first", "name" : "first", "payment_amount": sum})
相反,我想做这样的事情并获得相同的输出:
expected_output = example_df.groupby("user_id").agg({["user_id" , "category" , "name"]: "first", "payment_amount": sum})
但这会引发错误。如何做到这一点?
可以生成dict
:
d = {**{"payment_amount": 'sum'},
**dict.fromkeys(["user_id" , "category" , "name"], 'first')}
print (d)
{'payment_amount': 'sum', 'user_id': 'first', 'category': 'first', 'name': 'first'}
expected_output = example_df.groupby("user_id").agg(d)
更通用的解决方案应该是:
d = dict.fromkeys(example_df.columns, 'first')
d['payment_amount'] = 'sum'
print (d)
{'user_id': 'first', 'category': 'first', 'name': 'first', 'payment_amount': 'sum'}
expected_output = example_df.groupby("user_id").agg(d)
默认情况下,您可以使用带有显式列列表的字典推导式 sum
和 first
:
expected_output = (
example_df.groupby('user_id')
.agg({c: 'mean' if c in ('payment_amount') else 'first'
for c in example_df})
)
输出:
user_id category name payment_amount
user_id
12 12 furniture Casey 102.500000
13 13 electronics Alice 48.666667
但是我建议也使用数据类型来选择列:
expected_output = (
example_df.groupby('user_id')
.agg({k: 'mean' if v in ('int64', 'float64') else 'first'
for k,v in example_df.dtypes[1:].items()})
)