Pandas 反转累计计数
Pandas Inverting Cumulative Count
我得到了一个包含 累积 计数数据的数据框。生成示例如下(可跳过:
import numpy as np
import pandas as pd
cols = ['Start', 'End', 'Count']
data = np.array([
'2020-1-1', '2020-1-2', 4,
'2020-1-1', '2020-1-3', 6,
'2020-1-1', '2020-1-4', 8,
'2020-2-1', '2020-2-2', 3,
'2020-2-1', '2020-2-3', 4,
'2020-2-1', '2020-2-4', 4])
data = data.reshape((6,3))
df = pd.DataFrame(columns=cols, data=data)
df['Start'] = pd.to_datetime(df.Start)
df['End'] = pd.to_datetime(df.End)
这给出了以下数据框:
Start End Count
2020-1-1 2020-1-2 4
2020-1-1 2020-1-3 6
2020-1-1 2020-1-4 8
2020-2-1 2020-2-2 3
2020-2-1 2020-2-3 4
2020-2-1 2020-2-4 4
计数是累积的(累积从开始开始),我想撤消累积以获得(注意日期的变化):
Start End Count
2020-1-1 2020-1-2 4
2020-1-2 2020-1-3 2
2020-1-3 2020-1-4 2
2020-2-1 2020-2-2 3
2020-2-2 2020-2-3 1
2020-2-3 2020-2-4 0
我想对分组变量执行此操作。这可以通过以下方式天真地完成:
lst = []
for start, data in df.groupby(['Start', 'grouping_variable']):
data = data.sort_values('End')
diff = data.Count.diff()
diff.iloc[0] = data.Count.iloc[0]
start_dates = [data.Start.iloc[0]] + list(data.end[:-1].values)
data = data.assign(Start=start_dates,
Count=diff)
lst.append(data)
df = pd.concat(lst)
这在任何方面都让人感觉不“正确”、“pythonic”或“干净”。有没有更好的办法?也许 Pandas 有具体的方法来做到这一点?
IIUC,我们可以使用 cumcount
和布尔值来捕获每个唯一的开始日期组,然后使用 shift
对每个组应用 np.where
操作。
import numpy as np
#df['Count'] = df['Count'].astype(int)
s = df.groupby(['Start']).cumcount() == 0
df['Count'] = np.where(s,df['Count'],df['Count'] - df['Count'].shift())
df['Start'] = np.where(s, df['Start'], df['End'].shift(1))
print(df)
Start End Count
0 2020-01-01 2020-01-02 4.0
1 2020-01-02 2020-01-03 2.0
2 2020-01-03 2020-01-04 2.0
3 2020-02-01 2020-02-02 3.0
4 2020-02-02 2020-02-03 1.0
5 2020-02-03 2020-02-04 0.0
我得到了一个包含 累积 计数数据的数据框。生成示例如下(可跳过:
import numpy as np
import pandas as pd
cols = ['Start', 'End', 'Count']
data = np.array([
'2020-1-1', '2020-1-2', 4,
'2020-1-1', '2020-1-3', 6,
'2020-1-1', '2020-1-4', 8,
'2020-2-1', '2020-2-2', 3,
'2020-2-1', '2020-2-3', 4,
'2020-2-1', '2020-2-4', 4])
data = data.reshape((6,3))
df = pd.DataFrame(columns=cols, data=data)
df['Start'] = pd.to_datetime(df.Start)
df['End'] = pd.to_datetime(df.End)
这给出了以下数据框:
Start End Count
2020-1-1 2020-1-2 4
2020-1-1 2020-1-3 6
2020-1-1 2020-1-4 8
2020-2-1 2020-2-2 3
2020-2-1 2020-2-3 4
2020-2-1 2020-2-4 4
计数是累积的(累积从开始开始),我想撤消累积以获得(注意日期的变化):
Start End Count
2020-1-1 2020-1-2 4
2020-1-2 2020-1-3 2
2020-1-3 2020-1-4 2
2020-2-1 2020-2-2 3
2020-2-2 2020-2-3 1
2020-2-3 2020-2-4 0
我想对分组变量执行此操作。这可以通过以下方式天真地完成:
lst = []
for start, data in df.groupby(['Start', 'grouping_variable']):
data = data.sort_values('End')
diff = data.Count.diff()
diff.iloc[0] = data.Count.iloc[0]
start_dates = [data.Start.iloc[0]] + list(data.end[:-1].values)
data = data.assign(Start=start_dates,
Count=diff)
lst.append(data)
df = pd.concat(lst)
这在任何方面都让人感觉不“正确”、“pythonic”或“干净”。有没有更好的办法?也许 Pandas 有具体的方法来做到这一点?
IIUC,我们可以使用 cumcount
和布尔值来捕获每个唯一的开始日期组,然后使用 shift
对每个组应用 np.where
操作。
import numpy as np
#df['Count'] = df['Count'].astype(int)
s = df.groupby(['Start']).cumcount() == 0
df['Count'] = np.where(s,df['Count'],df['Count'] - df['Count'].shift())
df['Start'] = np.where(s, df['Start'], df['End'].shift(1))
print(df)
Start End Count
0 2020-01-01 2020-01-02 4.0
1 2020-01-02 2020-01-03 2.0
2 2020-01-03 2020-01-04 2.0
3 2020-02-01 2020-02-02 3.0
4 2020-02-02 2020-02-03 1.0
5 2020-02-03 2020-02-04 0.0