你如何在 Pandas 中使用条件语句进行分组和聚合？

Question

关于问题的扩展，我想知道如何根据条件向以下内容添加聚合：

Index    Name    Item            Quantity
0        John    Apple Red       10
1        John    Apple Green      5
2        John    Orange Cali     12
3        Jane    Apple Red       10
4        Jane    Apple Green      5
5        Jane    Orange Cali     18
6        Jane    Orange Spain     2
7        John    Banana           3
8        Jane    Coconut          5
9        John    Lime            10
... And so forth

我需要做的是将这些数据转换成如下所示的数据框。注意：我只对在单独的列中获取苹果和橙子的总量感兴趣，即任何出现在某个组中的其他项目都不会包含在聚合中在“数量”列上完成（但它们仍将作为字符串出现在“所有项目”列中）：

Index    Name    All Items                                          Apples Total  Oranges Total
0        John    Apple Red, Apple Green, Orange Cali, Banana, Lime  15             12
1        Jane    Apple Red, Apple Green, Orange Cali, Coconut       15             20

我该如何实现？非常感谢！

Answer 1

编辑：修复了一个错误。

为此，在执行分组之前，您可以创建 Total 列。这些将包含该行中苹果和橙子的数量，具体取决于该行的项目是苹果还是橙子。

df['Apples Total'] = df.apply(lambda x: x.Quantity if ('Apple' in x.Item) else 0, axis=1)
df['Oranges Total'] = df.apply(lambda x: x.Quantity if ('Orange' in x.Item) else 0, axis=1)

当这就位时，groupby name 和 aggregate on each column。对总计列求和，并聚合以在项目列上列出。

df.groupby('Name').agg({'Apples Total': 'sum',
                        'Oranges Total': 'sum',
                        'Item': lambda x: list(x)
                        })

Answer 2

您可以在提取 Apple 和 Orange 子字符串后使用 groupby 和 pivot_table，如下所示：

import re
s = df['Item'].str.extract("(Apple|Orange)",expand=False,flags=re.I)
# re.I used above is optional and is used for case insensitive matching

a = df.assign(Item_1=s).dropna(subset=['Item_1'])

out = (a.groupby("Name")['Item'].agg(",".join).to_frame().join(
        a.pivot_table("Quantity","Name","Item_1",aggfunc=sum).add_suffix("_Total"))
       .reset_index())

print(out)
   Name                                            Item  Apple_Total  \
0  Jane  Apple Red,Apple Green,Orange Cali,Orange Spain           15   
1  John               Apple Red,Apple Green,Orange Cali           15   

   Orange_Total  
0            20  
1            12

编辑：对于已编辑的问题，除了原始数据帧 df 上的 groupby 而不是子集 a 之外，您只能使用相同的代码，然后加入：

out = (df.groupby("Name")['Item'].agg(",".join).to_frame().join(
        a.pivot_table("Quantity","Name","Item_1",aggfunc=sum).add_suffix("_Total"))
       .reset_index())
print(out)

   Name                                               Item  Apple_Total  \
0  Jane  Apple Red,Apple Green,Orange Cali,Orange Spain...           15   
1  John      Apple Red,Apple Green,Orange Cali,Banana,Lime           15   

   Orange_Total  
0            20  
1            12

Answer 3

首先在 Item

列上使用 str.contains 仅过滤所需的行

from io import StringIO
import pandas as pd

s = StringIO("""Name;Item;Quantity
John;Apple Red;10
John;Apple Green;5
John;Orange Cali;12
Jane;Apple Red;10
Jane;Apple Green;5
Jane;Orange Cali;18
Jane;Orange Spain;2
John;Banana;3
Jane;Coconut;5
John;Lime;10
""")

df = pd.read_csv(s,sep=';')

req_items_idx = df[df.Item.str.contains('Apple|Orange')].index

df_filtered = df.loc[req_items_idx,:]

获得它们后，您可以进一步旋转数据以根据 Name

获得所需的值

pivot_df = pd.pivot_table(df_filtered,index=['Name'],columns=['Item'],aggfunc='sum')
pivot_df.columns = pivot_df.columns.droplevel()
pivot_df.columns.name = None
pivot_df = pivot_df.reset_index()

生成苹果和橙子的总数

orange_columns = pivot_df.columns[pivot_df.columns.str.contains('Orange')].tolist()
apple_columns = pivot_df.columns[pivot_df.columns.str.contains('Apple')].tolist()

pivot_df['Apples Total'] = pivot_df.loc[:,apple_columns].sum(axis=1)
pivot_df['Orange Total'] = pivot_df.loc[:,orange_columns].sum(axis=1)

将 Items 组合在一起的包装函数

def combine_items(inp,columns):
    res = []
    for val,col in zip(inp.values,columns):
        if not pd.isnull(val):
           res += [col]
    return ','.join(res)

req_columns = apple_columns+orange_columns
pivot_df['Items'] = pivot_df[apple_columns+orange_columns].apply(combine_items,args=([req_columns]),axis=1)

最后，您可以在一个地方获取所需的列并打印值

total_columns = pivot_df.columns[pivot_df.columns.str.contains('Total')].tolist()
name_item_columns = pivot_df.columns[pivot_df.columns.str.contains('Name|Items')].tolist()

>>> pivot_df[name_item_columns+total_columns]
   Name                                           Items  Apples Total  Orange Total
0  Jane  Apple Green,Apple Red,Orange Cali,Orange Spain          15.0          20.0
1  John               Apple Green,Apple Red,Orange Cali          15.0          12.0

答案旨在概述解决类似问题的各个步骤和方法

Answer 4

df = pd.read_csv(StringIO("""
Index,Name,Item,Quantity
0,John,Apple Red,10
1,John,Apple Green,5
2,John,Orange Cali,12
3,Jane,Apple Red,10
4,Jane,Apple Green,5
5,Jane,Orange Cali,18
6,Jane,Orange Spain,2
7,John,Banana,3
8,Jane,Coconut,5
9,John,Lime,10
"""))

正在获取项目列表

按名称分组以获得项目列表

items_list = pd.DataFrame(df.groupby(["Name"])["Item"].apply(list)).rename(columns={"Item": "All Items"})
items_list
        All Items
Name    
Jane    [Apple Red, Apple Green, Orange Cali, Orange Spain, Coconut]
John    [Apple Red, Apple Green, Orange Cali, Banana, Lime]

获取名称项组的计数

重命名 temp df items 列，以便所有 apples/oranges 得到类似处理

temp2 = df.groupby(["Name", "Item"])['Quantity'].apply(sum)
temp2 = pd.DataFrame(temp2).reset_index().set_index("Name")
temp2['Item'] = temp2['Item'].str.replace(r'(?:.*)(apple|orange)(?:.*)', r'', case=False,regex=True)
    temp2
    Item    Quantity
Name        
Jane    Apple   5
Jane    Apple   10
Jane    Coconut 5
Jane    Orange  18
Jane    Orange  2
John    Apple   5
John    Apple   10
John    Banana  3
John    Lime    10
John    Orange  12

获得所需的枢轴table

pivot table 用于将项目计数作为单独的列并仅保留苹果橙计数

pivot_df = pd.pivot_table(temp2, values='Quantity', columns='Item', index=["Name"], aggfunc=np.sum)
pivot_df = pivot_df[['Apple', 'Orange']]
    pivot_df
    Item    Apple   Orange
    Name        
    Jane    15.0    20.0
    John    15.0    12.0

合并项目列表 df 和 pivot_df

output = items_list.merge(pivot_df, on="Name").rename(columns = {'Apple': 'Apples
Total', 'Orange': 'Oranges Total'})
output
All Items   Apples Total    Oranges Total
Name            
Jane    [Apple Red, Apple Green, Orange Cali, Orange Spain, Coconut]    15.0    20.0
John    [Apple Red, Apple Green, Orange Cali, Banana, Lime] 15.0    12.0

你如何在 Pandas 中使用条件语句进行分组和聚合？

How do you groupby and aggregate using conditional statements in Pandas?

aggregation

conditional-statements

pandas

正在获取项目列表

获取名称项组的计数

获得所需的枢轴table

合并项目列表 df 和 pivot_df