在 Pandas 数据框中分组时缺少所需值时显示一列
Display a column when a desired value is missing while grouping in Pandas dataframe
美好的一天,我有一个包含区域、客户和一些交付的数据框。此列用作购买类型,第一次和最后一次购买标记为'first' 和 'last',有时我们有 之间的交付标记为“交付”。我需要标记客户和地区根本没有任何中间交货 ,作为所需输出中的一列。标记中间 delivery 并不难,但需要标记整个组 customer-region。
import pandas as pd
data = [['NY', 'A','FIRST', 10], ['NY', 'A','DELIVERY', 20], ['NY', 'A','DELIVERY', 30], ['NY', 'A','LAST', 25],
['NY', 'B','FIRST', 15], ['NY', 'B','DELIVERY', 10], ['NY', 'B','LAST', 20],
['FL', 'A','FIRST', 15], ['FL', 'A','DELIVERY', 10], ['FL', 'A','DELIVERY', 12], ['FL', 'A','DELIVERY', 25], ['FL', 'A','LAST', 20],
['FL', 'C','FIRST', 15], ['FL', 'C','LAST', 10],
['FL', 'D','FIRST', 10], ['FL', 'D','DELIVERY', 20], ['FL', 'D','LAST', 30],
['FL', 'E','FIRST', 20], ['FL', 'E','LAST', 20]
]
# Create the pandas DataFrame
df = pd.DataFrame(data, columns = ['region', 'customer', 'purchaseType', 'price'])
# print dataframe.
df
打印:
region customer purchaseType price
0 NY A FIRST 10
1 NY A DELIVERY 20
2 NY A DELIVERY 30
3 NY A LAST 25
4 NY B FIRST 15
5 NY B DELIVERY 10
6 NY B LAST 20
7 FL A FIRST 15
8 FL A DELIVERY 10
9 FL A DELIVERY 12
10 FL A DELIVERY 25
11 FL A LAST 20
12 FL C FIRST 15
13 FL C LAST 10
14 FL D FIRST 10
15 FL D DELIVERY 20
16 FL D LAST 30
17 FL E FIRST 20
18 FL E LAST 20
期望输出:
region customer purchaseType price noDeliveryFlag
0 NY A FIRST 10 0
1 NY A DELIVERY 20 0
2 NY A DELIVERY 30 0
3 NY A LAST 25 0
4 NY B FIRST 15 0
5 NY B DELIVERY 10 0
6 NY B LAST 20 0
7 FL A FIRST 15 0
8 FL A DELIVERY 10 0
9 FL A DELIVERY 12 0
10 FL A DELIVERY 25 0
11 FL A LAST 20 0
12 FL C FIRST 15 1
13 FL C LAST 10 1
14 FL D FIRST 10 0
15 FL D DELIVERY 20 0
16 FL D LAST 30 0
17 FL E FIRST 20 1
18 FL E LAST 20 1
非常感谢!
首先我们按地区和客户计算出交货状态。为此,我们按地区、客户分组,然后在每个组中检查 'DELIVERY' 是否包含在该组中的 purchaseType 系列部分中。如果没有交付,我们将 1 分配给该组,否则为 0(在这里使用 True/False 可能更自然,但坚持问题)
delivery_status = (df.groupby(['region', 'customer'], sort=False)['purchaseType']
.apply(lambda d: 1*('DELIVERY' not in d.values))
.rename('noDeliveryFlag')
)
delivery_status
这会产生
region customer
NY A 0
B 0
FL A 0
C 1
D 0
E 1
Name: noDeliveryFlag, dtype: int64
然后我们就把这个合并到原来的df中
(df.set_index(['region', 'customer'])
.join(delivery_status,how = 'left', sort=False)
.reset_index()
)
获得
region customer purchaseType price noDeliveryFlag
-- -------- ---------- -------------- ------- ----------------
0 FL A FIRST 15 0
1 FL A DELIVERY 10 0
2 FL A DELIVERY 12 0
3 FL A DELIVERY 25 0
4 FL A LAST 20 0
5 FL C FIRST 15 1
6 FL C LAST 10 1
7 FL D FIRST 10 0
8 FL D DELIVERY 20 0
9 FL D LAST 30 0
10 FL E FIRST 20 1
11 FL E LAST 20 1
12 NY A FIRST 10 0
13 NY A DELIVERY 20 0
14 NY A DELIVERY 30 0
15 NY A LAST 25 0
16 NY B FIRST 15 0
17 NY B DELIVERY 10 0
18 NY B LAST 20 0
请注意,该解决方案不会检查 FIRST 和 LAST 之间是否有 DELIVERY——它只是检查 region/customer 根本没有 DELIVERY。
我想我明白了
df['noDeliveryFlag'] = df['purchaseType'] != 'DELIVERY'
df['noDeliveryFlag'] = df.groupby(['region','customer'])['noDeliveryFlag'].transform('min').astype(int)
print(df)
如果有人有更有效的方法,我将不胜感激。
您可以将 transform
和 size
与 groupby 操作一起使用。
此方法假定只有 2 个 purchaseTypes 的任何人都没有交付,它不考虑正在进行的交付。
df['noDeliveryFlag'] = np.where(df.groupby(['customer','region'])
['purchaseType'].transform('size').eq(2),1,0)
region customer purchaseType price noDeliveryFlag
0 NY A FIRST 10 0
1 NY A DELIVERY 20 0
2 NY A DELIVERY 30 0
3 NY A LAST 25 0
4 NY B FIRST 15 0
5 NY B DELIVERY 10 0
6 NY B LAST 20 0
7 FL A FIRST 15 0
8 FL A DELIVERY 10 0
9 FL A DELIVERY 12 0
10 FL A DELIVERY 25 0
11 FL A LAST 20 0
12 FL C FIRST 15 1
13 FL C LAST 10 1
14 FL D FIRST 10 0
15 FL D DELIVERY 20 0
16 FL D LAST 30 0
17 FL E FIRST 20 1
18 FL E DELIVERY 20 1
美好的一天,我有一个包含区域、客户和一些交付的数据框。此列用作购买类型,第一次和最后一次购买标记为'first' 和 'last',有时我们有 之间的交付标记为“交付”。我需要标记客户和地区根本没有任何中间交货 ,作为所需输出中的一列。标记中间 delivery 并不难,但需要标记整个组 customer-region。
import pandas as pd
data = [['NY', 'A','FIRST', 10], ['NY', 'A','DELIVERY', 20], ['NY', 'A','DELIVERY', 30], ['NY', 'A','LAST', 25],
['NY', 'B','FIRST', 15], ['NY', 'B','DELIVERY', 10], ['NY', 'B','LAST', 20],
['FL', 'A','FIRST', 15], ['FL', 'A','DELIVERY', 10], ['FL', 'A','DELIVERY', 12], ['FL', 'A','DELIVERY', 25], ['FL', 'A','LAST', 20],
['FL', 'C','FIRST', 15], ['FL', 'C','LAST', 10],
['FL', 'D','FIRST', 10], ['FL', 'D','DELIVERY', 20], ['FL', 'D','LAST', 30],
['FL', 'E','FIRST', 20], ['FL', 'E','LAST', 20]
]
# Create the pandas DataFrame
df = pd.DataFrame(data, columns = ['region', 'customer', 'purchaseType', 'price'])
# print dataframe.
df
打印:
region customer purchaseType price
0 NY A FIRST 10
1 NY A DELIVERY 20
2 NY A DELIVERY 30
3 NY A LAST 25
4 NY B FIRST 15
5 NY B DELIVERY 10
6 NY B LAST 20
7 FL A FIRST 15
8 FL A DELIVERY 10
9 FL A DELIVERY 12
10 FL A DELIVERY 25
11 FL A LAST 20
12 FL C FIRST 15
13 FL C LAST 10
14 FL D FIRST 10
15 FL D DELIVERY 20
16 FL D LAST 30
17 FL E FIRST 20
18 FL E LAST 20
期望输出:
region customer purchaseType price noDeliveryFlag
0 NY A FIRST 10 0
1 NY A DELIVERY 20 0
2 NY A DELIVERY 30 0
3 NY A LAST 25 0
4 NY B FIRST 15 0
5 NY B DELIVERY 10 0
6 NY B LAST 20 0
7 FL A FIRST 15 0
8 FL A DELIVERY 10 0
9 FL A DELIVERY 12 0
10 FL A DELIVERY 25 0
11 FL A LAST 20 0
12 FL C FIRST 15 1
13 FL C LAST 10 1
14 FL D FIRST 10 0
15 FL D DELIVERY 20 0
16 FL D LAST 30 0
17 FL E FIRST 20 1
18 FL E LAST 20 1
非常感谢!
首先我们按地区和客户计算出交货状态。为此,我们按地区、客户分组,然后在每个组中检查 'DELIVERY' 是否包含在该组中的 purchaseType 系列部分中。如果没有交付,我们将 1 分配给该组,否则为 0(在这里使用 True/False 可能更自然,但坚持问题)
delivery_status = (df.groupby(['region', 'customer'], sort=False)['purchaseType']
.apply(lambda d: 1*('DELIVERY' not in d.values))
.rename('noDeliveryFlag')
)
delivery_status
这会产生
region customer
NY A 0
B 0
FL A 0
C 1
D 0
E 1
Name: noDeliveryFlag, dtype: int64
然后我们就把这个合并到原来的df中
(df.set_index(['region', 'customer'])
.join(delivery_status,how = 'left', sort=False)
.reset_index()
)
获得
region customer purchaseType price noDeliveryFlag
-- -------- ---------- -------------- ------- ----------------
0 FL A FIRST 15 0
1 FL A DELIVERY 10 0
2 FL A DELIVERY 12 0
3 FL A DELIVERY 25 0
4 FL A LAST 20 0
5 FL C FIRST 15 1
6 FL C LAST 10 1
7 FL D FIRST 10 0
8 FL D DELIVERY 20 0
9 FL D LAST 30 0
10 FL E FIRST 20 1
11 FL E LAST 20 1
12 NY A FIRST 10 0
13 NY A DELIVERY 20 0
14 NY A DELIVERY 30 0
15 NY A LAST 25 0
16 NY B FIRST 15 0
17 NY B DELIVERY 10 0
18 NY B LAST 20 0
请注意,该解决方案不会检查 FIRST 和 LAST 之间是否有 DELIVERY——它只是检查 region/customer 根本没有 DELIVERY。
我想我明白了
df['noDeliveryFlag'] = df['purchaseType'] != 'DELIVERY'
df['noDeliveryFlag'] = df.groupby(['region','customer'])['noDeliveryFlag'].transform('min').astype(int)
print(df)
如果有人有更有效的方法,我将不胜感激。
您可以将 transform
和 size
与 groupby 操作一起使用。
此方法假定只有 2 个 purchaseTypes 的任何人都没有交付,它不考虑正在进行的交付。
df['noDeliveryFlag'] = np.where(df.groupby(['customer','region'])
['purchaseType'].transform('size').eq(2),1,0)
region customer purchaseType price noDeliveryFlag
0 NY A FIRST 10 0
1 NY A DELIVERY 20 0
2 NY A DELIVERY 30 0
3 NY A LAST 25 0
4 NY B FIRST 15 0
5 NY B DELIVERY 10 0
6 NY B LAST 20 0
7 FL A FIRST 15 0
8 FL A DELIVERY 10 0
9 FL A DELIVERY 12 0
10 FL A DELIVERY 25 0
11 FL A LAST 20 0
12 FL C FIRST 15 1
13 FL C LAST 10 1
14 FL D FIRST 10 0
15 FL D DELIVERY 20 0
16 FL D LAST 30 0
17 FL E FIRST 20 1
18 FL E DELIVERY 20 1