日期时间和 NaT 之间的计数值

Count value between datetime and NaT

我有两个 python pandas 数据框,它们的简化形式如下所示:

DF1

+---------+---------+------+-------+
| Date_in | Date_out| Group| Item  |
+---------+---------+------+-------+
| 1991-08 | 2000-08 |   A  |   A1  |
| 1992-08 |   NaT   |   A  |   A2  |
| 1997-02 |   NaT   |   B  |   B1  |
| 1998-03 | 2001-03 |   C  |   C1  |
| 1999-02 | 2002-02 |   D  |   D1  |
| 2000-02 |   NaT   |   D  |   D2  |
| 2000-03 | 2001-04 |   D  |   D3  |
| 2001-08 |   NaT   |   D  |   D4  |
+---------+---------+------+-------+

DF2

+---------+-------+
|  Date   | Group | 
+---------+-------+
| 2000-01 |   A   | 
| 2001-02 |   A   | 
| 2001-03 |   B   |
| 2001-04 |   B   | 
| 2001-05 |   C   | 
| 2001-06 |   C   |
| 2001-03 |   D   |
| 2001-07 |   D   |
+---------|-------+

我想根据 DF1 中的日期限制统计 DF2 组列中仍然存在多少项目

期望的输出

+---------+-------+-------+
|  Date   | Group | Total |
+---------+-------+-------+
| 2000-01 |   A   |   2   |
| 2001-02 |   A   |   1   |
| 2001-03 |   B   |   1   |
| 2001-04 |   B   |   1   |
| 2001-05 |   C   |   0   |
| 2001-06 |   C   |   0   |
| 2001-03 |   D   |   3   |
| 2001-07 |   D   |   2   |
+---------|-------+-------+

您可以先转换所有日期时间并将缺少的 NaT 替换为第一步中的今天日期:

df2['Date'] = pd.to_datetime(df2['Date'])

df1['Date_in'] = pd.to_datetime(df1['Date_in'])
df1['Date_out'] = pd.to_datetime(df1['Date_out']).fillna(pd.to_datetime('now').normalize())
print (df1)
     Date_in   Date_out Group Item
0 1991-08-01 2000-08-01     A   A1
1 1992-08-01 2021-02-12     A   A2
2 1997-02-01 2021-02-12     B   B1
3 1998-03-01 2001-03-01     C   C1
4 1999-02-01 2002-02-01     D   D1
5 2000-02-01 2021-02-12     D   D2
6 2000-03-01 2001-04-01     D   D3
7 2001-08-01 2021-02-12     D   D4

然后获取 Date_inDate_out 之间的所有月份,并按 Grouper and GroupBy.size:

计算分组的月份
L = [pd.Series(r.Group,pd.date_range(r.Date_in, r.Date_out, freq='MS')) 
     for r in df1.itertuples()]
s = (pd.concat(L)
         .reset_index(name='Group')
         .groupby([pd.Grouper(key='index', freq='MS'), 'Group'])
         .size()
         .rename('Total'))

# print (s)

最后使用 DataFrame.join 添加新列,并将 NaN 替换为 0 以获取不匹配的值:

df2 = df2.join(s, on=['Date','Group'])
df2['Total'] = df2['Total'].fillna(0).astype(int)
print (df2)
        Date Group  Total
0 2000-01-01     A      2
1 2001-02-01     A      1
2 2001-03-01     B      1
3 2001-04-01     B      1
4 2001-05-01     C      0
5 2001-06-01     C      0
6 2001-03-01     D      3
7 2001-07-01     D      2

编辑:

在实际数据中需要使用 days,而不是日期时间,因此解决方案稍作修改:

#remove times
df2['date'] = df2['created_at'].dt.normalize()

#convert date_range by days
L = [pd.Series(r.dept_name,pd.date_range(r.start_date, r.end_date, freq='d')) 
     for r in df1.itertuples()]
s = (pd.concat(L)
    .reset_index(name='dept_name')
    .groupby([pd.Grouper(key='index', freq='D'), 'dept_name'])
    .size()
    .rename('total_member'))

#join by column date (without times)
df2 = df2.join(s, on=['date','dept_name'])
df2['total_member'] = df2['total_member'].fillna(0).astype(int)