如何找出 python pandas 数据框列(日期格式)中的空白?
How to find out the gaps in python pandas dataframe column (date format)?
我有一个 pandas 数据框,如下所示:
name,year
AAA,2015-11-02 22:00:00
AAA,2015-11-02 23:00:00
AAA,2015-11-03 00:00:00
AAA,2015-11-03 01:00:00
AAA,2015-11-03 02:00:00
AAA,2015-11-03 05:00:00
ZZZ,2015-09-01 00:00:00
ZZZ,2015-11-01 01:00:00
ZZZ,2015-11-01 07:00:00
ZZZ,2015-11-01 08:00:00
ZZZ,2015-11-01 09:00:00
ZZZ,2015-11-01 12:00:00
我想找出数据框的年份列中与特定名称相关的可用空白。
例如,
- AAA 名称在“2015-11-03 02:00:00”日期之前有 2 小时的差距。
- ZZZ 名称与“2015-11-01 01:00:00”日期相差 5 小时。
- ZZZ 名称与“2015-11-01 09:00:00”日期相差 2 小时。
我想生成两个包含以下内容的 csv 文件:
CSV-1:
name,year
AAA,2015-11-02 22:00:00,0
AAA,2015-11-02 23:00:00,0
AAA,2015-11-03 00:00:00,0
AAA,2015-11-03 01:00:00,0
AAA,2015-11-03 02:00:00,2
AAA,2015-11-03 05:00:00,0
ZZZ,2015-09-01 00:00:00,0
ZZZ,2015-11-01 01:00:00,5
ZZZ,2015-11-01 07:00:00,0
ZZZ,2015-11-01 08:00:00,0
ZZZ,2015-11-01 09:00:00,2
ZZZ,2015-11-01 12:00:00,0
CSV-2:
name,prev_year,next_year,gaps
AAA,2015-11-03 02:00:00,2015-11-03 05:00:00,2015-11-03 03:00:00
AAA,2015-11-03 02:00:00,2015-11-03 05:00:00,2015-11-03 04:00:00
ZZZ,2015-11-01 01:00:00,2015-11-01 07:00:00,2015-11-01 02:00:00
ZZZ,2015-11-01 01:00:00,2015-11-01 07:00:00,2015-11-01 03:00:00
ZZZ,2015-11-01 01:00:00,2015-11-01 07:00:00,2015-11-01 04:00:00
ZZZ,2015-11-01 01:00:00,2015-11-01 07:00:00,2015-11-01 05:00:00
ZZZ,2015-11-01 01:00:00,2015-11-01 07:00:00,2015-11-01 06:00:00
ZZZ,2015-11-01 09:00:00,2015-11-01 12:00:00,2015-11-01 10:00:00
ZZZ,2015-11-01 09:00:00,2015-11-01 12:00:00,2015-11-01 11:00:00
我试过如下:
df['year'] = pd.to_datetime(df['year'], format='%Y-%m-%d %H:%M:%S')
mask = df.groupby("name").year.diff() > pd.Timedelta('0 days 01:00:00')
为了让你的差距进入你的数据框,你需要重新分配你生成的 mask
。要以总小时数计算,您可以简单地除以 1 小时:
df['year'] = pd.to_datetime(df['year'], format='%Y-%m-%d %H:%M:%S')
df['Gap'] = (df.groupby("name").year.diff() / pd.to_timedelta('1 hour')).fillna(0)
这为我们提供了以下数据框:
name year Gap
0 AAA 2015-11-02 22:00:00 0.0
1 AAA 2015-11-02 23:00:00 1.0
2 AAA 2015-11-03 00:00:00 1.0
3 AAA 2015-11-03 01:00:00 1.0
4 AAA 2015-11-03 02:00:00 1.0
5 AAA 2015-11-03 05:00:00 3.0
6 ZZZ 2015-09-01 00:00:00 0.0
7 ZZZ 2015-11-01 07:00:00 6.0
8 ZZZ 2015-11-01 08:00:00 1.0
9 ZZZ 2015-11-01 09:00:00 1.0
10 ZZZ 2015-11-01 12:00:00 3.0
为了获得开始时间旁边的间隙并与您想要的 "csv-1" 方式一致,我们只需将其向上移动一行并在填充 na 值之前减去一行:
df['Gap'] = ((df.groupby("name").year.diff() / pd.to_timedelta('1 hour')).shift(-1) - 1).fillna(0)
这得到:
name year Gap
0 AAA 2015-11-02 22:00:00 0.0
1 AAA 2015-11-02 23:00:00 0.0
2 AAA 2015-11-03 00:00:00 0.0
3 AAA 2015-11-03 01:00:00 0.0
4 AAA 2015-11-03 02:00:00 2.0
5 AAA 2015-11-03 05:00:00 0.0
6 ZZZ 2015-11-01 01:00:00 5.0
7 ZZZ 2015-11-01 07:00:00 0.0
8 ZZZ 2015-11-01 08:00:00 0.0
9 ZZZ 2015-11-01 09:00:00 2.0
10 ZZZ 2015-11-01 12:00:00 0.0
为了获得您的第二个 csv,我们可以执行以下操作:
df['prev_year'] = df['year']
df['next_year'] = df.groupby('name')['year'].shift(-1)
df.set_index('year', inplace=True)
df = df.groupby('name', as_index=False)\
.resample(rule='1H')\
.ffill()\
.reset_index()
gaps = df[df['year'] != df['prev_year']][['name', 'prev_year', 'next_year', 'year']]
gaps.rename({'year': 'gaps'}, index='columns', inplace=True)
首先我们设置 "before" 和 "after" 列。然后通过将索引更改为 'year'
,我们可以使用 .resample()
方法来填充我们所有缺失的时间。通过在重新采样时使用 ffill()
,我们将最后一条可用记录复制到我们添加的所有新行中。我们知道,当 'prev_year' != 'year'
时,我们位于帧中以前不存在的行上,因此是空白之一,因此我们只过滤那些行,select 我们的列需要并重命名它们。这给出:
name prev_year next_year year
5 AAA 2015-11-03 02:00:00 2015-11-03 05:00:00 2015-11-03 03:00:00
6 AAA 2015-11-03 02:00:00 2015-11-03 05:00:00 2015-11-03 04:00:00
9 ZZZ 2015-11-01 01:00:00 2015-11-01 07:00:00 2015-11-01 02:00:00
10 ZZZ 2015-11-01 01:00:00 2015-11-01 07:00:00 2015-11-01 03:00:00
11 ZZZ 2015-11-01 01:00:00 2015-11-01 07:00:00 2015-11-01 04:00:00
12 ZZZ 2015-11-01 01:00:00 2015-11-01 07:00:00 2015-11-01 05:00:00
13 ZZZ 2015-11-01 01:00:00 2015-11-01 07:00:00 2015-11-01 06:00:00
17 ZZZ 2015-11-01 09:00:00 2015-11-01 12:00:00 2015-11-01 10:00:00
18 ZZZ 2015-11-01 09:00:00 2015-11-01 12:00:00 2015-11-01 11:00:00
总而言之,您的脚本可能如下所示:
df['year'] = pd.to_datetime(df['year'], format='%Y-%m-%d %H:%M:%S')
df['Gap'] = ((df.groupby("name").year.diff() / pd.to_timedelta('1 hour')).shift(-1) - 1).fillna(0)
df.to_csv('csv-1.csv', index=False)
df['prev_year'] = df['year']
df['next_year'] = df.groupby('name')['year'].shift(-1)
df.set_index('year', inplace=True)
df = df.groupby('name', as_index=False)\
.resample(rule='1H')\
.ffill()\
.reset_index()
gaps = df[df['year'] != df['prev_year']][['name', 'prev_year', 'next_year', 'year']]
gaps.rename({'year': 'gaps'}, index='columns', inplace=True)
gaps.to_csv('csv-2.csv', index=False)
我有一个 pandas 数据框,如下所示:
name,year
AAA,2015-11-02 22:00:00
AAA,2015-11-02 23:00:00
AAA,2015-11-03 00:00:00
AAA,2015-11-03 01:00:00
AAA,2015-11-03 02:00:00
AAA,2015-11-03 05:00:00
ZZZ,2015-09-01 00:00:00
ZZZ,2015-11-01 01:00:00
ZZZ,2015-11-01 07:00:00
ZZZ,2015-11-01 08:00:00
ZZZ,2015-11-01 09:00:00
ZZZ,2015-11-01 12:00:00
我想找出数据框的年份列中与特定名称相关的可用空白。 例如,
- AAA 名称在“2015-11-03 02:00:00”日期之前有 2 小时的差距。
- ZZZ 名称与“2015-11-01 01:00:00”日期相差 5 小时。
- ZZZ 名称与“2015-11-01 09:00:00”日期相差 2 小时。
我想生成两个包含以下内容的 csv 文件:
CSV-1:
name,year
AAA,2015-11-02 22:00:00,0
AAA,2015-11-02 23:00:00,0
AAA,2015-11-03 00:00:00,0
AAA,2015-11-03 01:00:00,0
AAA,2015-11-03 02:00:00,2
AAA,2015-11-03 05:00:00,0
ZZZ,2015-09-01 00:00:00,0
ZZZ,2015-11-01 01:00:00,5
ZZZ,2015-11-01 07:00:00,0
ZZZ,2015-11-01 08:00:00,0
ZZZ,2015-11-01 09:00:00,2
ZZZ,2015-11-01 12:00:00,0
CSV-2:
name,prev_year,next_year,gaps
AAA,2015-11-03 02:00:00,2015-11-03 05:00:00,2015-11-03 03:00:00
AAA,2015-11-03 02:00:00,2015-11-03 05:00:00,2015-11-03 04:00:00
ZZZ,2015-11-01 01:00:00,2015-11-01 07:00:00,2015-11-01 02:00:00
ZZZ,2015-11-01 01:00:00,2015-11-01 07:00:00,2015-11-01 03:00:00
ZZZ,2015-11-01 01:00:00,2015-11-01 07:00:00,2015-11-01 04:00:00
ZZZ,2015-11-01 01:00:00,2015-11-01 07:00:00,2015-11-01 05:00:00
ZZZ,2015-11-01 01:00:00,2015-11-01 07:00:00,2015-11-01 06:00:00
ZZZ,2015-11-01 09:00:00,2015-11-01 12:00:00,2015-11-01 10:00:00
ZZZ,2015-11-01 09:00:00,2015-11-01 12:00:00,2015-11-01 11:00:00
我试过如下:
df['year'] = pd.to_datetime(df['year'], format='%Y-%m-%d %H:%M:%S')
mask = df.groupby("name").year.diff() > pd.Timedelta('0 days 01:00:00')
为了让你的差距进入你的数据框,你需要重新分配你生成的 mask
。要以总小时数计算,您可以简单地除以 1 小时:
df['year'] = pd.to_datetime(df['year'], format='%Y-%m-%d %H:%M:%S')
df['Gap'] = (df.groupby("name").year.diff() / pd.to_timedelta('1 hour')).fillna(0)
这为我们提供了以下数据框:
name year Gap
0 AAA 2015-11-02 22:00:00 0.0
1 AAA 2015-11-02 23:00:00 1.0
2 AAA 2015-11-03 00:00:00 1.0
3 AAA 2015-11-03 01:00:00 1.0
4 AAA 2015-11-03 02:00:00 1.0
5 AAA 2015-11-03 05:00:00 3.0
6 ZZZ 2015-09-01 00:00:00 0.0
7 ZZZ 2015-11-01 07:00:00 6.0
8 ZZZ 2015-11-01 08:00:00 1.0
9 ZZZ 2015-11-01 09:00:00 1.0
10 ZZZ 2015-11-01 12:00:00 3.0
为了获得开始时间旁边的间隙并与您想要的 "csv-1" 方式一致,我们只需将其向上移动一行并在填充 na 值之前减去一行:
df['Gap'] = ((df.groupby("name").year.diff() / pd.to_timedelta('1 hour')).shift(-1) - 1).fillna(0)
这得到:
name year Gap
0 AAA 2015-11-02 22:00:00 0.0
1 AAA 2015-11-02 23:00:00 0.0
2 AAA 2015-11-03 00:00:00 0.0
3 AAA 2015-11-03 01:00:00 0.0
4 AAA 2015-11-03 02:00:00 2.0
5 AAA 2015-11-03 05:00:00 0.0
6 ZZZ 2015-11-01 01:00:00 5.0
7 ZZZ 2015-11-01 07:00:00 0.0
8 ZZZ 2015-11-01 08:00:00 0.0
9 ZZZ 2015-11-01 09:00:00 2.0
10 ZZZ 2015-11-01 12:00:00 0.0
为了获得您的第二个 csv,我们可以执行以下操作:
df['prev_year'] = df['year']
df['next_year'] = df.groupby('name')['year'].shift(-1)
df.set_index('year', inplace=True)
df = df.groupby('name', as_index=False)\
.resample(rule='1H')\
.ffill()\
.reset_index()
gaps = df[df['year'] != df['prev_year']][['name', 'prev_year', 'next_year', 'year']]
gaps.rename({'year': 'gaps'}, index='columns', inplace=True)
首先我们设置 "before" 和 "after" 列。然后通过将索引更改为 'year'
,我们可以使用 .resample()
方法来填充我们所有缺失的时间。通过在重新采样时使用 ffill()
,我们将最后一条可用记录复制到我们添加的所有新行中。我们知道,当 'prev_year' != 'year'
时,我们位于帧中以前不存在的行上,因此是空白之一,因此我们只过滤那些行,select 我们的列需要并重命名它们。这给出:
name prev_year next_year year
5 AAA 2015-11-03 02:00:00 2015-11-03 05:00:00 2015-11-03 03:00:00
6 AAA 2015-11-03 02:00:00 2015-11-03 05:00:00 2015-11-03 04:00:00
9 ZZZ 2015-11-01 01:00:00 2015-11-01 07:00:00 2015-11-01 02:00:00
10 ZZZ 2015-11-01 01:00:00 2015-11-01 07:00:00 2015-11-01 03:00:00
11 ZZZ 2015-11-01 01:00:00 2015-11-01 07:00:00 2015-11-01 04:00:00
12 ZZZ 2015-11-01 01:00:00 2015-11-01 07:00:00 2015-11-01 05:00:00
13 ZZZ 2015-11-01 01:00:00 2015-11-01 07:00:00 2015-11-01 06:00:00
17 ZZZ 2015-11-01 09:00:00 2015-11-01 12:00:00 2015-11-01 10:00:00
18 ZZZ 2015-11-01 09:00:00 2015-11-01 12:00:00 2015-11-01 11:00:00
总而言之,您的脚本可能如下所示:
df['year'] = pd.to_datetime(df['year'], format='%Y-%m-%d %H:%M:%S')
df['Gap'] = ((df.groupby("name").year.diff() / pd.to_timedelta('1 hour')).shift(-1) - 1).fillna(0)
df.to_csv('csv-1.csv', index=False)
df['prev_year'] = df['year']
df['next_year'] = df.groupby('name')['year'].shift(-1)
df.set_index('year', inplace=True)
df = df.groupby('name', as_index=False)\
.resample(rule='1H')\
.ffill()\
.reset_index()
gaps = df[df['year'] != df['prev_year']][['name', 'prev_year', 'next_year', 'year']]
gaps.rename({'year': 'gaps'}, index='columns', inplace=True)
gaps.to_csv('csv-2.csv', index=False)