内部加入 pandas 个具有日期范围列的数据框
Inner join pandas dataframes with date range columns
我有 2 个带日期范围的数据框
>>>>df1 = pd.DataFrame({'ID': ['a', 'a', 'a', 'a', 'b', 'b', 'b'],
'from_dt':[pd.to_datetime('2004-01-01 00:00:00'),
pd.to_datetime('2005-01-01 00:00:00'),
pd.to_datetime('2007-01-01 00:00:00'),
pd.to_datetime('2011-01-01 00:00:00'),
pd.to_datetime('2004-01-01 00:00:00'),
pd.to_datetime('2012-01-01 00:00:00'),
pd.to_datetime('2019-01-01 00:00:00'),
],
'thru_dt':[pd.to_datetime('2004-12-31 23:59:59'),
pd.to_datetime('2006-12-31 23:59:59'),
pd.to_datetime('2010-12-31 23:59:59'),
pd.to_datetime('2075-12-31 23:59:59'),
pd.to_datetime('2011-12-31 23:59:59'),
pd.to_datetime('2018-12-31 23:59:59'),
pd.to_datetime('2075-12-31 23:59:59'),
],
'val':[1,2,3,4,5,6,7]})
>>>>df1
ID from_dt thru_dt val
0 a 2004-01-01 2004-12-31 23:59:59 1
1 a 2005-01-01 2006-12-31 23:59:59 2
2 a 2007-01-01 2010-12-31 23:59:59 3
3 a 2011-01-01 2075-12-31 23:59:59 4
4 b 2004-01-01 2011-12-31 23:59:59 5
5 b 2012-01-01 2018-12-31 23:59:59 6
6 b 2019-01-01 2075-12-31 23:59:59 7
第一个包含 ID 到值的映射
>>>>df2 = pd.DataFrame({'ID':['a', 'a', 'b'], 'ID2':['A1', 'A2', 'B1'],
'from_dt':[pd.to_datetime('2003-01-01 00:00:00'),
pd.to_datetime('2010-01-01 00:00:00'),
pd.to_datetime('2005-01-01 00:00:00'),
],
'thru_dt':[pd.to_datetime('2009-12-31 23:59:59'),
pd.to_datetime('2075-12-31 23:59:59'),
pd.to_datetime('2075-12-31 23:59:59'),
]
})
>>>>df2
ID ID2 from_dt thru_dt
0 a A1 2003-01-01 2009-12-31 23:59:59
1 a A2 2010-01-01 2075-12-31 23:59:59
2 b B1 2005-01-01 2075-12-31 23:59:59
第二个数据帧将一个 ID 映射到另一个。
我想用一些内部连接逻辑加入它们,这样我就有了一个具有以下输出的平面文件数据框
>>>>df3 = some_function(df1, df2, end_date=pd.to_datetime('2020-12-31'))
>>>>df3
ID2 Date val
0 A1 2004-01-01 1
1 A1 2004-01-02 1
.
.
x A1 2004-12-31 1
x A1 2005-01-01 2
x A1 2005-01-02 2
.
.
x A1 2009-12-31 3
x A2 2010-01-01 3
.
.
x A2 2020-12-31 4
x B1 2005-01-01 5
.
.
x B1 2020-12-31 7
我确定我可以循环并以某种低效的方式执行此操作。我很想了解可以处理此类任务的任何现有工具和库。
谢谢!
我认为您可以使用列表理解来展平日期范围,然后按结束日期时间进行过滤:
s1 = pd.concat([pd.Series(r.Index,pd.date_range(r.from_dt, r.thru_dt))
for r in df1.itertuples()])
df11 = df1.drop(['from_dt','thru_dt'], 1).join(pd.DataFrame({'Date':s1.index}, index=s1))
df11 = df11[df11['Date'].lt('2020-12-31')]
print (df11)
s2 = pd.concat([pd.Series(r.Index,pd.date_range(r.from_dt, r.thru_dt))
for r in df2.itertuples()])
df22 = df2.drop(['from_dt','thru_dt'], 1).join(pd.DataFrame({'Date':s2.index}, index=s2))
df22 = df22[df22['Date'].lt('2020-12-31')]
print (df22)
我有 2 个带日期范围的数据框
>>>>df1 = pd.DataFrame({'ID': ['a', 'a', 'a', 'a', 'b', 'b', 'b'],
'from_dt':[pd.to_datetime('2004-01-01 00:00:00'),
pd.to_datetime('2005-01-01 00:00:00'),
pd.to_datetime('2007-01-01 00:00:00'),
pd.to_datetime('2011-01-01 00:00:00'),
pd.to_datetime('2004-01-01 00:00:00'),
pd.to_datetime('2012-01-01 00:00:00'),
pd.to_datetime('2019-01-01 00:00:00'),
],
'thru_dt':[pd.to_datetime('2004-12-31 23:59:59'),
pd.to_datetime('2006-12-31 23:59:59'),
pd.to_datetime('2010-12-31 23:59:59'),
pd.to_datetime('2075-12-31 23:59:59'),
pd.to_datetime('2011-12-31 23:59:59'),
pd.to_datetime('2018-12-31 23:59:59'),
pd.to_datetime('2075-12-31 23:59:59'),
],
'val':[1,2,3,4,5,6,7]})
>>>>df1
ID from_dt thru_dt val
0 a 2004-01-01 2004-12-31 23:59:59 1
1 a 2005-01-01 2006-12-31 23:59:59 2
2 a 2007-01-01 2010-12-31 23:59:59 3
3 a 2011-01-01 2075-12-31 23:59:59 4
4 b 2004-01-01 2011-12-31 23:59:59 5
5 b 2012-01-01 2018-12-31 23:59:59 6
6 b 2019-01-01 2075-12-31 23:59:59 7
第一个包含 ID 到值的映射
>>>>df2 = pd.DataFrame({'ID':['a', 'a', 'b'], 'ID2':['A1', 'A2', 'B1'],
'from_dt':[pd.to_datetime('2003-01-01 00:00:00'),
pd.to_datetime('2010-01-01 00:00:00'),
pd.to_datetime('2005-01-01 00:00:00'),
],
'thru_dt':[pd.to_datetime('2009-12-31 23:59:59'),
pd.to_datetime('2075-12-31 23:59:59'),
pd.to_datetime('2075-12-31 23:59:59'),
]
})
>>>>df2
ID ID2 from_dt thru_dt
0 a A1 2003-01-01 2009-12-31 23:59:59
1 a A2 2010-01-01 2075-12-31 23:59:59
2 b B1 2005-01-01 2075-12-31 23:59:59
第二个数据帧将一个 ID 映射到另一个。
我想用一些内部连接逻辑加入它们,这样我就有了一个具有以下输出的平面文件数据框
>>>>df3 = some_function(df1, df2, end_date=pd.to_datetime('2020-12-31'))
>>>>df3
ID2 Date val
0 A1 2004-01-01 1
1 A1 2004-01-02 1
.
.
x A1 2004-12-31 1
x A1 2005-01-01 2
x A1 2005-01-02 2
.
.
x A1 2009-12-31 3
x A2 2010-01-01 3
.
.
x A2 2020-12-31 4
x B1 2005-01-01 5
.
.
x B1 2020-12-31 7
我确定我可以循环并以某种低效的方式执行此操作。我很想了解可以处理此类任务的任何现有工具和库。
谢谢!
我认为您可以使用列表理解来展平日期范围,然后按结束日期时间进行过滤:
s1 = pd.concat([pd.Series(r.Index,pd.date_range(r.from_dt, r.thru_dt))
for r in df1.itertuples()])
df11 = df1.drop(['from_dt','thru_dt'], 1).join(pd.DataFrame({'Date':s1.index}, index=s1))
df11 = df11[df11['Date'].lt('2020-12-31')]
print (df11)
s2 = pd.concat([pd.Series(r.Index,pd.date_range(r.from_dt, r.thru_dt))
for r in df2.itertuples()])
df22 = df2.drop(['from_dt','thru_dt'], 1).join(pd.DataFrame({'Date':s2.index}, index=s2))
df22 = df22[df22['Date'].lt('2020-12-31')]
print (df22)