pandas groupby 将字符串值与前一行值进行比较,并在新列中发现变化
pandas groupby comparing string value with previous row value and spot changes in new columns
我有这个演示 df
:
info = {'customer': ['Jason', 'Jason', 'Jason', 'Jason',
'Molly', 'Molly', 'Molly', 'Molly'],
'Good': ['Cookie', 'Cookie', 'Cookie', 'Cookie','Ice Cream',
'Ice Cream', 'Ice Cream', 'Ice Cream'],
'Date' :['2021-12-14','2022-01-04','2022-01-11','2022-01-18',
'2022-01-12','2022-01-15','2022-01-19','2022-01-30'],
'Flavor' :['Chocolate','Vanilla','Vanilla','Strawberry',
'Chocolate', 'Vanilla', 'Caramel', 'Caramel']}
df = pd.DataFrame(data=info)
df
给出:
customer Good Date Flavor
0 Jason Cookie 2021-12-14 Chocolate
1 Jason Cookie 2022-01-04 Vanilla
2 Jason Cookie 2022-01-11 Vanilla
3 Jason Cookie 2022-01-18 Strawberry
4 Molly Ice Cream 2022-01-12 Chocolate
5 Molly Ice Cream 2022-01-15 Vanilla
6 Molly Ice Cream 2022-01-19 Caramel
7 Molly Ice Cream 2022-01-30 Caramel
我正在尝试在新列 From
- To
中跟踪每个客户每个商品的口味变化。我做了分组部分:
df.sort_values(['Date']).groupby(['customer','Good','Date'])['Flavor'].sum()
我得到了:
customer Good Date
Jason Cookie 2021-12-14 Chocolate
2022-01-04 Vanilla
2022-01-11 Vanilla
2022-01-18 Strawberry
Molly Ice Cream 2022-01-12 Chocolate
2022-01-15 Vanilla
2022-01-19 Caramel
2022-01-30 Caramel
Name: Flavor, dtype: object
每组的第一行是入口点然后我想比较每组的下一个变化,如果不同则我们跟踪新列的变化(从 & 到) 如果相似的值没有任何反应。
我尝试了多种方法和代码,但不幸的是我不知道最好的方法。
考虑到 reset_index()
的预期输出:
customer Good Date Flavor From To
0 Jason Cookie 2021-12-14 Chocolate
1 Jason Cookie 2022-01-04 Vanilla Chocolate Vanilla
2 Jason Cookie 2022-01-11 Vanilla
3 Jason Cookie 2022-01-18 Strawberry Vanilla Strawberry
4 Molly Ice Cream 2022-01-12 Chocolate
5 Molly Ice Cream 2022-01-15 Vanilla Chocolate Vanilla
6 Molly Ice Cream 2022-01-19 Caramel Vanilla Caramel
7 Molly Ice Cream 2022-01-30 Caramel
在您创建的 sum
(名为 g
)的基础上,我们可以 groupby
索引的前 2 级和 shift
它,然后 join
它回到 g
。在 rename
-ing 列之后,mask
“To”和“From”列取决于是否有任何更改或是否为 NaN。最后,join
这回到 DataFrame:
g = df.sort_values(['Date']).groupby(['customer','Good','Date'])['Flavor'].sum()
joined = g.to_frame().assign(To=g).join(g.groupby(level=[0,1]).shift().to_frame(), lsuffix='', rsuffix='_').rename(columns={'Flavor_':'From'})
joined.update(joined[['To','From']].mask(joined['From'].isna() | joined['From'].eq(joined['To']), ''))
out = joined[['Flavor','From','To']].reset_index()
输出:
customer Good Date Flavor From To
0 Jason Cookie 2021-12-14 Chocolate
1 Jason Cookie 2022-01-04 Vanilla Chocolate Vanilla
2 Jason Cookie 2022-01-11 Vanilla
3 Jason Cookie 2022-01-18 Strawberry Vanilla Strawberry
4 Molly Ice Cream 2022-01-12 Chocolate
5 Molly Ice Cream 2022-01-15 Vanilla Chocolate Vanilla
6 Molly Ice Cream 2022-01-19 Caramel Vanilla Caramel
7 Molly Ice Cream 2022-01-30 Caramel
s=df.assign(
From = df.sort_values(by='Date').groupby(['customer', 'Good'])['Flavor'].apply(lambda x: x.shift(1)),
To = df['Flavor']
).dropna()
out = df.join(s[s['From'] != s['To']].iloc[:,-2:]).fillna('')
customer Good Date Flavor From To
0 Jason Cookie 2021-12-14 Chocolate
1 Jason Cookie 2022-01-04 Vanilla Chocolate Vanilla
2 Jason Cookie 2022-01-11 Vanilla
3 Jason Cookie 2022-01-18 Strawberry Vanilla Strawberry
4 Molly Ice Cream 2022-01-12 Chocolate
5 Molly Ice Cream 2022-01-15 Vanilla Chocolate Vanilla
6 Molly Ice Cream 2022-01-19 Caramel Vanilla Caramel
7 Molly Ice Cream 2022-01-30 Caramel
我有这个演示 df
:
info = {'customer': ['Jason', 'Jason', 'Jason', 'Jason',
'Molly', 'Molly', 'Molly', 'Molly'],
'Good': ['Cookie', 'Cookie', 'Cookie', 'Cookie','Ice Cream',
'Ice Cream', 'Ice Cream', 'Ice Cream'],
'Date' :['2021-12-14','2022-01-04','2022-01-11','2022-01-18',
'2022-01-12','2022-01-15','2022-01-19','2022-01-30'],
'Flavor' :['Chocolate','Vanilla','Vanilla','Strawberry',
'Chocolate', 'Vanilla', 'Caramel', 'Caramel']}
df = pd.DataFrame(data=info)
df
给出:
customer Good Date Flavor
0 Jason Cookie 2021-12-14 Chocolate
1 Jason Cookie 2022-01-04 Vanilla
2 Jason Cookie 2022-01-11 Vanilla
3 Jason Cookie 2022-01-18 Strawberry
4 Molly Ice Cream 2022-01-12 Chocolate
5 Molly Ice Cream 2022-01-15 Vanilla
6 Molly Ice Cream 2022-01-19 Caramel
7 Molly Ice Cream 2022-01-30 Caramel
我正在尝试在新列 From
- To
中跟踪每个客户每个商品的口味变化。我做了分组部分:
df.sort_values(['Date']).groupby(['customer','Good','Date'])['Flavor'].sum()
我得到了:
customer Good Date
Jason Cookie 2021-12-14 Chocolate
2022-01-04 Vanilla
2022-01-11 Vanilla
2022-01-18 Strawberry
Molly Ice Cream 2022-01-12 Chocolate
2022-01-15 Vanilla
2022-01-19 Caramel
2022-01-30 Caramel
Name: Flavor, dtype: object
每组的第一行是入口点然后我想比较每组的下一个变化,如果不同则我们跟踪新列的变化(从 & 到) 如果相似的值没有任何反应。
我尝试了多种方法和代码,但不幸的是我不知道最好的方法。
考虑到 reset_index()
的预期输出:
customer Good Date Flavor From To
0 Jason Cookie 2021-12-14 Chocolate
1 Jason Cookie 2022-01-04 Vanilla Chocolate Vanilla
2 Jason Cookie 2022-01-11 Vanilla
3 Jason Cookie 2022-01-18 Strawberry Vanilla Strawberry
4 Molly Ice Cream 2022-01-12 Chocolate
5 Molly Ice Cream 2022-01-15 Vanilla Chocolate Vanilla
6 Molly Ice Cream 2022-01-19 Caramel Vanilla Caramel
7 Molly Ice Cream 2022-01-30 Caramel
在您创建的 sum
(名为 g
)的基础上,我们可以 groupby
索引的前 2 级和 shift
它,然后 join
它回到 g
。在 rename
-ing 列之后,mask
“To”和“From”列取决于是否有任何更改或是否为 NaN。最后,join
这回到 DataFrame:
g = df.sort_values(['Date']).groupby(['customer','Good','Date'])['Flavor'].sum()
joined = g.to_frame().assign(To=g).join(g.groupby(level=[0,1]).shift().to_frame(), lsuffix='', rsuffix='_').rename(columns={'Flavor_':'From'})
joined.update(joined[['To','From']].mask(joined['From'].isna() | joined['From'].eq(joined['To']), ''))
out = joined[['Flavor','From','To']].reset_index()
输出:
customer Good Date Flavor From To
0 Jason Cookie 2021-12-14 Chocolate
1 Jason Cookie 2022-01-04 Vanilla Chocolate Vanilla
2 Jason Cookie 2022-01-11 Vanilla
3 Jason Cookie 2022-01-18 Strawberry Vanilla Strawberry
4 Molly Ice Cream 2022-01-12 Chocolate
5 Molly Ice Cream 2022-01-15 Vanilla Chocolate Vanilla
6 Molly Ice Cream 2022-01-19 Caramel Vanilla Caramel
7 Molly Ice Cream 2022-01-30 Caramel
s=df.assign(
From = df.sort_values(by='Date').groupby(['customer', 'Good'])['Flavor'].apply(lambda x: x.shift(1)),
To = df['Flavor']
).dropna()
out = df.join(s[s['From'] != s['To']].iloc[:,-2:]).fillna('')
customer Good Date Flavor From To
0 Jason Cookie 2021-12-14 Chocolate
1 Jason Cookie 2022-01-04 Vanilla Chocolate Vanilla
2 Jason Cookie 2022-01-11 Vanilla
3 Jason Cookie 2022-01-18 Strawberry Vanilla Strawberry
4 Molly Ice Cream 2022-01-12 Chocolate
5 Molly Ice Cream 2022-01-15 Vanilla Chocolate Vanilla
6 Molly Ice Cream 2022-01-19 Caramel Vanilla Caramel
7 Molly Ice Cream 2022-01-30 Caramel