计算连续 3 天的不同计数 Pandas?
Calculate Distinct Count in rolling 3 Days Pandas?
我想统计 3 天内的唯一身份客户window 按城市分组
输入:
df = pd.DataFrame([['1A', 'Cairo', '2020-12-01'],
["2A", 'Cairo', '2020-12-01'],
['1A', 'Cairo', '2020-12-02'],
['1A', 'Cairo', '2020-12-03'],
['3A', 'Alex', '2020-12-01'],
['3A', 'Alex', '2020-12-02'],
['3A', 'Alex', '2020-12-03'],
['4A', 'Giza', '2020-12-02'],
['4A', 'Giza', '2020-12-02'],
['5A', 'Giza', '2020-12-03'],
['6A', 'Giza', '2020-12-01']], columns=
['customer_id', 'city', 'day'])
预期输出:
output = pd.DataFrame([['Alex', '2020-12-01',1],
['Alex', '2020-12-02',1],
['Alex', '2020-12-03',1],
['Cairo', '2020-12-01',2],
['Cairo', '2020-12-02',2],
['Cairo', '2020-12-03',2],
['Giza', '2020-12-01',1],
['Giza', '2020-12-02',2],
['Giza', '2020-12-03',3]], columns=
['city', 'day', 'unique_customers_last3Days'])
我试过:
df['day'] = pd.to_datetime(df['day'])
df.set_index('day',inplace=True)
df.sort_index(inplace=True)
df.groupby('city').rolling("3D").agg({'customer_id':'nun'})
但它给我错误
AttributeError: 'nunique' is not a valid function for 'RollingGroupby' object
将数据帧的索引设置为 day
然后 sort
索引值,现在 factorize
customer_id
列以便为每个客户分配唯一代码id,然后 group
city
和 apply
上的数据帧 rolling
nunique
操作 window 大小为 3 days
。可选 drop
day
中每个 city
的重复值
df = df.set_index('day').sort_index()
df['codes'] = df['customer_id'].factorize()[0]
df.groupby('city')\
.rolling('3D')['codes'].apply(pd.Series.nunique)\
.reset_index(name='unique').drop_duplicates(['city', 'day'], keep='last')
city day unique
0 Alex 2020-12-01 1.0
1 Alex 2020-12-02 1.0
2 Alex 2020-12-03 1.0
4 Cairo 2020-12-01 2.0
5 Cairo 2020-12-02 2.0
6 Cairo 2020-12-03 2.0
7 Giza 2020-12-01 1.0
9 Giza 2020-12-02 2.0
10 Giza 2020-12-03 3.0
我想统计 3 天内的唯一身份客户window 按城市分组
输入:
df = pd.DataFrame([['1A', 'Cairo', '2020-12-01'],
["2A", 'Cairo', '2020-12-01'],
['1A', 'Cairo', '2020-12-02'],
['1A', 'Cairo', '2020-12-03'],
['3A', 'Alex', '2020-12-01'],
['3A', 'Alex', '2020-12-02'],
['3A', 'Alex', '2020-12-03'],
['4A', 'Giza', '2020-12-02'],
['4A', 'Giza', '2020-12-02'],
['5A', 'Giza', '2020-12-03'],
['6A', 'Giza', '2020-12-01']], columns=
['customer_id', 'city', 'day'])
预期输出:
output = pd.DataFrame([['Alex', '2020-12-01',1],
['Alex', '2020-12-02',1],
['Alex', '2020-12-03',1],
['Cairo', '2020-12-01',2],
['Cairo', '2020-12-02',2],
['Cairo', '2020-12-03',2],
['Giza', '2020-12-01',1],
['Giza', '2020-12-02',2],
['Giza', '2020-12-03',3]], columns=
['city', 'day', 'unique_customers_last3Days'])
我试过:
df['day'] = pd.to_datetime(df['day'])
df.set_index('day',inplace=True)
df.sort_index(inplace=True)
df.groupby('city').rolling("3D").agg({'customer_id':'nun'})
但它给我错误
AttributeError: 'nunique' is not a valid function for 'RollingGroupby' object
将数据帧的索引设置为 day
然后 sort
索引值,现在 factorize
customer_id
列以便为每个客户分配唯一代码id,然后 group
city
和 apply
上的数据帧 rolling
nunique
操作 window 大小为 3 days
。可选 drop
day
中每个 city
df = df.set_index('day').sort_index()
df['codes'] = df['customer_id'].factorize()[0]
df.groupby('city')\
.rolling('3D')['codes'].apply(pd.Series.nunique)\
.reset_index(name='unique').drop_duplicates(['city', 'day'], keep='last')
city day unique
0 Alex 2020-12-01 1.0
1 Alex 2020-12-02 1.0
2 Alex 2020-12-03 1.0
4 Cairo 2020-12-01 2.0
5 Cairo 2020-12-02 2.0
6 Cairo 2020-12-03 2.0
7 Giza 2020-12-01 1.0
9 Giza 2020-12-02 2.0
10 Giza 2020-12-03 3.0