如何在给定一个滞后差异的情况下减去两列并按 Python 中的多列分组
How to subtract two columns given one lag differences and group by multiple columns in Python
我有一个包含两个 ID 列和两个日期列的数据,如下所示:
import numpy as np
import pandas as pd
mydata = {'ID1': [1,1,2,3,3,4],
'ID2': [1,2,3,4,5,6],
'Date1': ['2011-04-23','2011-05-13','2012-04-23','2012-05-13','2011-08-23','2011-08-26'],
'Date2': ['2011-04-25','2011-05-23','2012-04-1','2011-05-18','2011-08-24','2011-08-29']
}
mydata = pd.DataFrame(mydata)
我想创建一个新专栏,例如天,如下:如果ID1是唯一的,那么-1;如果 ID1 不是唯一的,则计算 Date1(有滞后)和 Date2 之间的差异。下面的代码以某种方式起作用;它不会为唯一 ID1 生成 -1。这也有点奇怪。感谢您对任何替代解决方案的帮助。
mydata['Date1'] = pd.to_datetime(mydata['Date1'])
mydata['Date2'] = pd.to_datetime(mydata['Date2'])
mydata = mydata.sort_values(['ID1', 'Date1'], ascending=[True, True])
diff_time = mydata['Date2'].rsub(mydata['Date1'].shift(-1), axis=0)
mydata['days'] = np.where(mydata['ID1']==mydata['ID1'].shift(-1),
(diff_time.dt.days*24+diff_time.astype(str).str.split('[ :]').str[2].astype(float))/24,0)
输出:
ID1 ID2 Date1 Date2 days
0 1 1 2011-04-23 2011-04-25 18.0
1 1 2 2011-05-13 2011-05-23 0.0
2 2 3 2012-04-23 2012-04-01 0.0 # 0.0 here should be -1 as ID1 is unique
4 3 5 2011-08-23 2011-08-24 263.0
3 3 4 2012-05-13 2011-05-18 0.0
5 4 6 2011-08-26 2011-08-29 0.0 # 0.0 here should be -1 as ID1 is unique
您可以使用 DataFrameGroupBy.shift
and for duplicated ID
with Series.duplicated
else -1
in numpy.where
:
mydata['Date1'] = pd.to_datetime(mydata['Date1'])
mydata['Date2'] = pd.to_datetime(mydata['Date2'])
mydata = mydata.sort_values(['ID1', 'Date1'], ascending=[True, True])
mask = mydata['ID1'].duplicated(keep=False)
diff_time = mydata['Date2'].rsub(mydata.groupby('ID1')['Date1'].shift(-1))
mydata['days'] = np.where(mask, diff_time.dt.days, -1)
print (mydata)
ID1 ID2 Date1 Date2 days
0 1 1 2011-04-23 2011-04-25 18.0
1 1 2 2011-05-13 2011-05-23 NaN
2 2 3 2012-04-23 2012-04-01 -1.0
4 3 5 2011-08-23 2011-08-24 263.0
3 3 4 2012-05-13 2011-05-18 NaN
5 4 6 2011-08-26 2011-08-29 -1.0
我有一个包含两个 ID 列和两个日期列的数据,如下所示:
import numpy as np
import pandas as pd
mydata = {'ID1': [1,1,2,3,3,4],
'ID2': [1,2,3,4,5,6],
'Date1': ['2011-04-23','2011-05-13','2012-04-23','2012-05-13','2011-08-23','2011-08-26'],
'Date2': ['2011-04-25','2011-05-23','2012-04-1','2011-05-18','2011-08-24','2011-08-29']
}
mydata = pd.DataFrame(mydata)
我想创建一个新专栏,例如天,如下:如果ID1是唯一的,那么-1;如果 ID1 不是唯一的,则计算 Date1(有滞后)和 Date2 之间的差异。下面的代码以某种方式起作用;它不会为唯一 ID1 生成 -1。这也有点奇怪。感谢您对任何替代解决方案的帮助。
mydata['Date1'] = pd.to_datetime(mydata['Date1'])
mydata['Date2'] = pd.to_datetime(mydata['Date2'])
mydata = mydata.sort_values(['ID1', 'Date1'], ascending=[True, True])
diff_time = mydata['Date2'].rsub(mydata['Date1'].shift(-1), axis=0)
mydata['days'] = np.where(mydata['ID1']==mydata['ID1'].shift(-1),
(diff_time.dt.days*24+diff_time.astype(str).str.split('[ :]').str[2].astype(float))/24,0)
输出:
ID1 ID2 Date1 Date2 days
0 1 1 2011-04-23 2011-04-25 18.0
1 1 2 2011-05-13 2011-05-23 0.0
2 2 3 2012-04-23 2012-04-01 0.0 # 0.0 here should be -1 as ID1 is unique
4 3 5 2011-08-23 2011-08-24 263.0
3 3 4 2012-05-13 2011-05-18 0.0
5 4 6 2011-08-26 2011-08-29 0.0 # 0.0 here should be -1 as ID1 is unique
您可以使用 DataFrameGroupBy.shift
and for duplicated ID
with Series.duplicated
else -1
in numpy.where
:
mydata['Date1'] = pd.to_datetime(mydata['Date1'])
mydata['Date2'] = pd.to_datetime(mydata['Date2'])
mydata = mydata.sort_values(['ID1', 'Date1'], ascending=[True, True])
mask = mydata['ID1'].duplicated(keep=False)
diff_time = mydata['Date2'].rsub(mydata.groupby('ID1')['Date1'].shift(-1))
mydata['days'] = np.where(mask, diff_time.dt.days, -1)
print (mydata)
ID1 ID2 Date1 Date2 days
0 1 1 2011-04-23 2011-04-25 18.0
1 1 2 2011-05-13 2011-05-23 NaN
2 2 3 2012-04-23 2012-04-01 -1.0
4 3 5 2011-08-23 2011-08-24 263.0
3 3 4 2012-05-13 2011-05-18 NaN
5 4 6 2011-08-26 2011-08-29 -1.0