累计/自 Pandas 首次发货后 3 天内发货的累计单位数
Rolling up / Cumulative sum of units shipped in the 3 days since first shipment in Pandas
有点难以解释,但我会尽力而为,请耐心等待。
我有一个包含 ID、发货日期和单位的 pd。
我想计算 3 天时间范围内发货的单位,并且计数不应重叠,例如我的数据框如下。
ID Shipping Date Units Expected output
153131151007 20180801 1 1
153131151007 20180828 1 2
153131151007 20180829 1 0
153131151007 20180904 1 1
153131151007 20181226 2 4
153131151007 20181227 1 0
153131151007 20181228 1 0
153131151007 20190110 1 1
153131151007 20190115 2 3
153131151007 20190116 1 0
153131151011* 20180510 1 2
153131151011* 20180511 1 0
153131151011* 20180513 1 2
153131151011* 20180515 1 0
153131151011* 20180813 1 1
153131151011* 20180822 1 2
153131151011* 20180824 1 0
153131151011* 20190103 1 1
代码应该检查日期,看看接下来 3 天内是否有任何发货,如果有发货,它应该在当前日期列中对其求和,并确保它不考虑总计数下一个日期计算。
因此对于第一个 ID 发货日期 20181226,它检查 1226、1227、1228 并将它们加在一起并在 1226 中显示结果,在接下来的 2 个单元格中显示 0。
与第 2 个 ID 20180510 类似,0510 是系列中发货的第一个日期。它检查 0510,0511 和 0512 并在 0510 中求和并将其余部分归零,这就是为什么 0511 不考虑 0513 并且它是其他装运组的一部分。
data = pd.DataFrame({'ID':['153131151007','153131151007','153131151007','153131151007','153131151007','153131151007','153131151007','153131151007','153131151007','153131151007','153131151011*','153131151011*','153131151011*','153131151011*','153131151011*','153131151011*','153131151011*','153131151011*'],
'Date':[20180801,20180828,20180829,20180904,20181226,20181227,20181228,20190110,20190115,20190116,20180510,20180511,20180513,20180515,20180813,20180822,20180824,20190103],
'Units':[1,1,1,1,2,1,1,1,2,1,1,1,1,1,1,1,1,1]})
这有效,但结果是宽格式:
import pandas as pd
import numpy as np
from dateutil.parser import parse
from datetime import timedelta
data = pd.DataFrame({'ID':['153131151007','153131151007','153131151007','153131151007','153131151007','153131151007','153131151007','153131151007','153131151007','153131151007','153131151011*','153131151011*','153131151011*','153131151011*','153131151011*','153131151011*','153131151011*','153131151011*'],
'Date':[20180801,20180828,20180829,20180904,20181226,20181227,20181228,20190110,20190115,20190116,20180510,20180511,20180513,20180515,20180813,20180822,20180824,20190103],
'Units':[1,1,1,1,2,1,1,1,2,1,1,1,1,1,1,1,1,1]})
def keep_first(ser):
ixs = []
ts = ser.dropna().index[0]
while ts <= ser.dropna().index.max():
if ts in ser.dropna().index:
ixs.append(ts)
ts+=timedelta(3)
else:
ts+=timedelta(1)
return np.where(ser.index.isin(ixs), ser, 0)
data['Date'] = data['Date'].map(lambda x: parse(str(x))) # parse dates
units = data.groupby(['ID', 'Date']).sum().unstack(0).resample('D').sum() # create resampled units df
units = units.sort_index(ascending=False).rolling(3, min_periods=1).sum().sort_index() # calculate forward-rolling sum
grouped_ix = data.groupby(['ID', 'Date']).sum().unstack(0).index # get indices for actual data
units.loc[grouped_ix].apply(keep_first) # get sums for actual data indices, keep only first
有点难以解释,但我会尽力而为,请耐心等待。
我有一个包含 ID、发货日期和单位的 pd。 我想计算 3 天时间范围内发货的单位,并且计数不应重叠,例如我的数据框如下。
ID Shipping Date Units Expected output
153131151007 20180801 1 1
153131151007 20180828 1 2
153131151007 20180829 1 0
153131151007 20180904 1 1
153131151007 20181226 2 4
153131151007 20181227 1 0
153131151007 20181228 1 0
153131151007 20190110 1 1
153131151007 20190115 2 3
153131151007 20190116 1 0
153131151011* 20180510 1 2
153131151011* 20180511 1 0
153131151011* 20180513 1 2
153131151011* 20180515 1 0
153131151011* 20180813 1 1
153131151011* 20180822 1 2
153131151011* 20180824 1 0
153131151011* 20190103 1 1
代码应该检查日期,看看接下来 3 天内是否有任何发货,如果有发货,它应该在当前日期列中对其求和,并确保它不考虑总计数下一个日期计算。
因此对于第一个 ID 发货日期 20181226,它检查 1226、1227、1228 并将它们加在一起并在 1226 中显示结果,在接下来的 2 个单元格中显示 0。
与第 2 个 ID 20180510 类似,0510 是系列中发货的第一个日期。它检查 0510,0511 和 0512 并在 0510 中求和并将其余部分归零,这就是为什么 0511 不考虑 0513 并且它是其他装运组的一部分。
data = pd.DataFrame({'ID':['153131151007','153131151007','153131151007','153131151007','153131151007','153131151007','153131151007','153131151007','153131151007','153131151007','153131151011*','153131151011*','153131151011*','153131151011*','153131151011*','153131151011*','153131151011*','153131151011*'],
'Date':[20180801,20180828,20180829,20180904,20181226,20181227,20181228,20190110,20190115,20190116,20180510,20180511,20180513,20180515,20180813,20180822,20180824,20190103],
'Units':[1,1,1,1,2,1,1,1,2,1,1,1,1,1,1,1,1,1]})
这有效,但结果是宽格式:
import pandas as pd
import numpy as np
from dateutil.parser import parse
from datetime import timedelta
data = pd.DataFrame({'ID':['153131151007','153131151007','153131151007','153131151007','153131151007','153131151007','153131151007','153131151007','153131151007','153131151007','153131151011*','153131151011*','153131151011*','153131151011*','153131151011*','153131151011*','153131151011*','153131151011*'],
'Date':[20180801,20180828,20180829,20180904,20181226,20181227,20181228,20190110,20190115,20190116,20180510,20180511,20180513,20180515,20180813,20180822,20180824,20190103],
'Units':[1,1,1,1,2,1,1,1,2,1,1,1,1,1,1,1,1,1]})
def keep_first(ser):
ixs = []
ts = ser.dropna().index[0]
while ts <= ser.dropna().index.max():
if ts in ser.dropna().index:
ixs.append(ts)
ts+=timedelta(3)
else:
ts+=timedelta(1)
return np.where(ser.index.isin(ixs), ser, 0)
data['Date'] = data['Date'].map(lambda x: parse(str(x))) # parse dates
units = data.groupby(['ID', 'Date']).sum().unstack(0).resample('D').sum() # create resampled units df
units = units.sort_index(ascending=False).rolling(3, min_periods=1).sum().sort_index() # calculate forward-rolling sum
grouped_ix = data.groupby(['ID', 'Date']).sum().unstack(0).index # get indices for actual data
units.loc[grouped_ix].apply(keep_first) # get sums for actual data indices, keep only first