日期时间和 NaT 之间的计数值
Count value between datetime and NaT
我有两个 python pandas 数据框,它们的简化形式如下所示:
DF1
+---------+---------+------+-------+
| Date_in | Date_out| Group| Item |
+---------+---------+------+-------+
| 1991-08 | 2000-08 | A | A1 |
| 1992-08 | NaT | A | A2 |
| 1997-02 | NaT | B | B1 |
| 1998-03 | 2001-03 | C | C1 |
| 1999-02 | 2002-02 | D | D1 |
| 2000-02 | NaT | D | D2 |
| 2000-03 | 2001-04 | D | D3 |
| 2001-08 | NaT | D | D4 |
+---------+---------+------+-------+
DF2
+---------+-------+
| Date | Group |
+---------+-------+
| 2000-01 | A |
| 2001-02 | A |
| 2001-03 | B |
| 2001-04 | B |
| 2001-05 | C |
| 2001-06 | C |
| 2001-03 | D |
| 2001-07 | D |
+---------|-------+
我想根据 DF1 中的日期限制统计 DF2 组列中仍然存在多少项目
期望的输出
+---------+-------+-------+
| Date | Group | Total |
+---------+-------+-------+
| 2000-01 | A | 2 |
| 2001-02 | A | 1 |
| 2001-03 | B | 1 |
| 2001-04 | B | 1 |
| 2001-05 | C | 0 |
| 2001-06 | C | 0 |
| 2001-03 | D | 3 |
| 2001-07 | D | 2 |
+---------|-------+-------+
您可以先转换所有日期时间并将缺少的 NaT
替换为第一步中的今天日期:
df2['Date'] = pd.to_datetime(df2['Date'])
df1['Date_in'] = pd.to_datetime(df1['Date_in'])
df1['Date_out'] = pd.to_datetime(df1['Date_out']).fillna(pd.to_datetime('now').normalize())
print (df1)
Date_in Date_out Group Item
0 1991-08-01 2000-08-01 A A1
1 1992-08-01 2021-02-12 A A2
2 1997-02-01 2021-02-12 B B1
3 1998-03-01 2001-03-01 C C1
4 1999-02-01 2002-02-01 D D1
5 2000-02-01 2021-02-12 D D2
6 2000-03-01 2001-04-01 D D3
7 2001-08-01 2021-02-12 D D4
然后获取 Date_in
和 Date_out
之间的所有月份,并按 Grouper
and GroupBy.size
:
计算分组的月份
L = [pd.Series(r.Group,pd.date_range(r.Date_in, r.Date_out, freq='MS'))
for r in df1.itertuples()]
s = (pd.concat(L)
.reset_index(name='Group')
.groupby([pd.Grouper(key='index', freq='MS'), 'Group'])
.size()
.rename('Total'))
# print (s)
最后使用 DataFrame.join
添加新列,并将 NaN
替换为 0
以获取不匹配的值:
df2 = df2.join(s, on=['Date','Group'])
df2['Total'] = df2['Total'].fillna(0).astype(int)
print (df2)
Date Group Total
0 2000-01-01 A 2
1 2001-02-01 A 1
2 2001-03-01 B 1
3 2001-04-01 B 1
4 2001-05-01 C 0
5 2001-06-01 C 0
6 2001-03-01 D 3
7 2001-07-01 D 2
编辑:
在实际数据中需要使用 days
,而不是日期时间,因此解决方案稍作修改:
#remove times
df2['date'] = df2['created_at'].dt.normalize()
#convert date_range by days
L = [pd.Series(r.dept_name,pd.date_range(r.start_date, r.end_date, freq='d'))
for r in df1.itertuples()]
s = (pd.concat(L)
.reset_index(name='dept_name')
.groupby([pd.Grouper(key='index', freq='D'), 'dept_name'])
.size()
.rename('total_member'))
#join by column date (without times)
df2 = df2.join(s, on=['date','dept_name'])
df2['total_member'] = df2['total_member'].fillna(0).astype(int)
我有两个 python pandas 数据框,它们的简化形式如下所示:
DF1
+---------+---------+------+-------+
| Date_in | Date_out| Group| Item |
+---------+---------+------+-------+
| 1991-08 | 2000-08 | A | A1 |
| 1992-08 | NaT | A | A2 |
| 1997-02 | NaT | B | B1 |
| 1998-03 | 2001-03 | C | C1 |
| 1999-02 | 2002-02 | D | D1 |
| 2000-02 | NaT | D | D2 |
| 2000-03 | 2001-04 | D | D3 |
| 2001-08 | NaT | D | D4 |
+---------+---------+------+-------+
DF2
+---------+-------+
| Date | Group |
+---------+-------+
| 2000-01 | A |
| 2001-02 | A |
| 2001-03 | B |
| 2001-04 | B |
| 2001-05 | C |
| 2001-06 | C |
| 2001-03 | D |
| 2001-07 | D |
+---------|-------+
我想根据 DF1 中的日期限制统计 DF2 组列中仍然存在多少项目
期望的输出
+---------+-------+-------+
| Date | Group | Total |
+---------+-------+-------+
| 2000-01 | A | 2 |
| 2001-02 | A | 1 |
| 2001-03 | B | 1 |
| 2001-04 | B | 1 |
| 2001-05 | C | 0 |
| 2001-06 | C | 0 |
| 2001-03 | D | 3 |
| 2001-07 | D | 2 |
+---------|-------+-------+
您可以先转换所有日期时间并将缺少的 NaT
替换为第一步中的今天日期:
df2['Date'] = pd.to_datetime(df2['Date'])
df1['Date_in'] = pd.to_datetime(df1['Date_in'])
df1['Date_out'] = pd.to_datetime(df1['Date_out']).fillna(pd.to_datetime('now').normalize())
print (df1)
Date_in Date_out Group Item
0 1991-08-01 2000-08-01 A A1
1 1992-08-01 2021-02-12 A A2
2 1997-02-01 2021-02-12 B B1
3 1998-03-01 2001-03-01 C C1
4 1999-02-01 2002-02-01 D D1
5 2000-02-01 2021-02-12 D D2
6 2000-03-01 2001-04-01 D D3
7 2001-08-01 2021-02-12 D D4
然后获取 Date_in
和 Date_out
之间的所有月份,并按 Grouper
and GroupBy.size
:
L = [pd.Series(r.Group,pd.date_range(r.Date_in, r.Date_out, freq='MS'))
for r in df1.itertuples()]
s = (pd.concat(L)
.reset_index(name='Group')
.groupby([pd.Grouper(key='index', freq='MS'), 'Group'])
.size()
.rename('Total'))
# print (s)
最后使用 DataFrame.join
添加新列,并将 NaN
替换为 0
以获取不匹配的值:
df2 = df2.join(s, on=['Date','Group'])
df2['Total'] = df2['Total'].fillna(0).astype(int)
print (df2)
Date Group Total
0 2000-01-01 A 2
1 2001-02-01 A 1
2 2001-03-01 B 1
3 2001-04-01 B 1
4 2001-05-01 C 0
5 2001-06-01 C 0
6 2001-03-01 D 3
7 2001-07-01 D 2
编辑:
在实际数据中需要使用 days
,而不是日期时间,因此解决方案稍作修改:
#remove times
df2['date'] = df2['created_at'].dt.normalize()
#convert date_range by days
L = [pd.Series(r.dept_name,pd.date_range(r.start_date, r.end_date, freq='d'))
for r in df1.itertuples()]
s = (pd.concat(L)
.reset_index(name='dept_name')
.groupby([pd.Grouper(key='index', freq='D'), 'dept_name'])
.size()
.rename('total_member'))
#join by column date (without times)
df2 = df2.join(s, on=['date','dept_name'])
df2['total_member'] = df2['total_member'].fillna(0).astype(int)