如何计算每个用户的最大和最小日期之间的差异
How to calculate difference between max and min date for each user
如何计算每个用户的最大和最小日期之间的差异
我试过 -
df['diff'] = (df['purschase_date'].max() - df['purschase_date'].min()).dt.days
但是它计算了所有行,而不是这个特定的用户,我如何通过 user_id
来计算它
这是数据框的样子
试试这个:
import pandas as pd
d = {'user_id': [2432730,2432730, 2432731,2432731],
'purchase_date': ["2020-09-09", "2020-08-09","2020-09-09","2020-09-19"]}
df = pd.DataFrame(data=d)
df['purchase_date']=pd.to_datetime(df['purchase_date'])
输入:
max_date=df[['user_id','purchase_date']].groupby(by='user_id').max().reset_index().rename(columns={'purchase_date':'max_date'})
min_date=df[['user_id','purchase_date']].groupby(by='user_id').min().reset_index().rename(columns={'purchase_date':'min_date'})
min_date=min_date.join(max_date['max_date'])
min_date['diff']=(min_date['max_date']-min_date['min_date']).dt.days
min_date
输出:
你只需要一个groupbyuser_id来计算差异
df.groupby('user_id')['purchase_date'].max() - df.groupby('user_id')['purchase_date'].min()
这将创建一个系列,它不能直接分配给 df
,因为只有两个用户,所以您将有两行。所以你需要将结果分配回数据框。
为日期的最小值和最大值创建新的数据框,按 ID 分组,命名列,
稍后与原始合并。
数据输入:
import numpy as np
import pandas as pd
df = pd.DataFrame({
"user_id": (np.random.randint(10000,10004,15, dtype="int32")),
"purchase_date": (pd.date_range(start='2022-01-01', periods=15, freq='8H')),
"C": pd.Series(1, index=list(range(15)), dtype="float32"),
"D": np.array([5] * 15, dtype="int32"),
"E": "foo",
})
df['purchase_date'] = pd.to_datetime(df['purchase_date']).dt.normalize()
# Solution
df_grouped = df.groupby(['user_id']).agg(
date_min=('purchase_date', 'min'),
date_max=('purchase_date', 'max'))\
.reset_index()
df_grouped['diff']=(df_grouped['date_max']-df_grouped['date_min']).dt.days
df1 = pd.merge(df, df_grouped)
df1
输出:
user_id purchase_date C D E date_min date_max diff
0 10001 2022-01-01 1.0 5 foo 2022-01-01 2022-01-04 3
1 10001 2022-01-02 1.0 5 foo 2022-01-01 2022-01-04 3
2 10001 2022-01-03 1.0 5 foo 2022-01-01 2022-01-04 3
3 10001 2022-01-04 1.0 5 foo 2022-01-01 2022-01-04 3
4 10000 2022-01-01 1.0 5 foo 2022-01-01 2022-01-04 3
5 10000 2022-01-02 1.0 5 foo 2022-01-01 2022-01-04 3
6 10000 2022-01-03 1.0 5 foo 2022-01-01 2022-01-04 3
7 10000 2022-01-04 1.0 5 foo 2022-01-01 2022-01-04 3
8 10002 2022-01-01 1.0 5 foo 2022-01-01 2022-01-05 4
9 10002 2022-01-02 1.0 5 foo 2022-01-01 2022-01-05 4
10 10002 2022-01-03 1.0 5 foo 2022-01-01 2022-01-05 4
11 10002 2022-01-05 1.0 5 foo 2022-01-01 2022-01-05 4
12 10002 2022-01-05 1.0 5 foo 2022-01-01 2022-01-05 4
13 10003 2022-01-04 1.0 5 foo 2022-01-04 2022-01-05 1
14 10003 2022-01-05 1.0 5 foo 2022-01-04 2022-01-05 1
如何计算每个用户的最大和最小日期之间的差异
我试过 -
df['diff'] = (df['purschase_date'].max() - df['purschase_date'].min()).dt.days
但是它计算了所有行,而不是这个特定的用户,我如何通过 user_id
来计算它这是数据框的样子
试试这个:
import pandas as pd
d = {'user_id': [2432730,2432730, 2432731,2432731],
'purchase_date': ["2020-09-09", "2020-08-09","2020-09-09","2020-09-19"]}
df = pd.DataFrame(data=d)
df['purchase_date']=pd.to_datetime(df['purchase_date'])
输入:
max_date=df[['user_id','purchase_date']].groupby(by='user_id').max().reset_index().rename(columns={'purchase_date':'max_date'})
min_date=df[['user_id','purchase_date']].groupby(by='user_id').min().reset_index().rename(columns={'purchase_date':'min_date'})
min_date=min_date.join(max_date['max_date'])
min_date['diff']=(min_date['max_date']-min_date['min_date']).dt.days
min_date
输出:
你只需要一个groupbyuser_id来计算差异
df.groupby('user_id')['purchase_date'].max() - df.groupby('user_id')['purchase_date'].min()
这将创建一个系列,它不能直接分配给 df
,因为只有两个用户,所以您将有两行。所以你需要将结果分配回数据框。
为日期的最小值和最大值创建新的数据框,按 ID 分组,命名列, 稍后与原始合并。
数据输入:
import numpy as np
import pandas as pd
df = pd.DataFrame({
"user_id": (np.random.randint(10000,10004,15, dtype="int32")),
"purchase_date": (pd.date_range(start='2022-01-01', periods=15, freq='8H')),
"C": pd.Series(1, index=list(range(15)), dtype="float32"),
"D": np.array([5] * 15, dtype="int32"),
"E": "foo",
})
df['purchase_date'] = pd.to_datetime(df['purchase_date']).dt.normalize()
# Solution
df_grouped = df.groupby(['user_id']).agg(
date_min=('purchase_date', 'min'),
date_max=('purchase_date', 'max'))\
.reset_index()
df_grouped['diff']=(df_grouped['date_max']-df_grouped['date_min']).dt.days
df1 = pd.merge(df, df_grouped)
df1
输出:
user_id purchase_date C D E date_min date_max diff
0 10001 2022-01-01 1.0 5 foo 2022-01-01 2022-01-04 3
1 10001 2022-01-02 1.0 5 foo 2022-01-01 2022-01-04 3
2 10001 2022-01-03 1.0 5 foo 2022-01-01 2022-01-04 3
3 10001 2022-01-04 1.0 5 foo 2022-01-01 2022-01-04 3
4 10000 2022-01-01 1.0 5 foo 2022-01-01 2022-01-04 3
5 10000 2022-01-02 1.0 5 foo 2022-01-01 2022-01-04 3
6 10000 2022-01-03 1.0 5 foo 2022-01-01 2022-01-04 3
7 10000 2022-01-04 1.0 5 foo 2022-01-01 2022-01-04 3
8 10002 2022-01-01 1.0 5 foo 2022-01-01 2022-01-05 4
9 10002 2022-01-02 1.0 5 foo 2022-01-01 2022-01-05 4
10 10002 2022-01-03 1.0 5 foo 2022-01-01 2022-01-05 4
11 10002 2022-01-05 1.0 5 foo 2022-01-01 2022-01-05 4
12 10002 2022-01-05 1.0 5 foo 2022-01-01 2022-01-05 4
13 10003 2022-01-04 1.0 5 foo 2022-01-04 2022-01-05 1
14 10003 2022-01-05 1.0 5 foo 2022-01-04 2022-01-05 1