使用 Pandas 填写缺失的数据
Fill in the missing data using Pandas
使用 Pandas 填充缺失数据的最佳方法是什么。我有一份访客名单,其中缺少退出时间或进入时间。
visitor entry exit
A 16/02/2016 08:46 16/02/2016 09:01
A 16/02/2016 09:20 16/02/2016 17:24
A 17/02/2016 09:12 17/02/2016 09:42
A 17/02/2016 09:55 NaT
A 17/02/2016 12:42 17/02/2016 12:56
A 17/02/2016 13:02 17/02/2016 17:32
A 17/02/2016 17:44 17/02/2016 18:24
A 18/02/2016 07:59 18/02/2016 16:40
A 18/02/2016 16:53 NaT
A NaT 19/02/2016 09:11
A 19/02/2016 09:27 19/02/2016 11:26
A 19/02/2016 12:28 19/02/2016 17:12
A 20/02/2016 08:44 20/02/2016 08:58
A 20/02/2016 09:16 20/02/2016 17:21
您可以使用 DataFrame.ffill
+ DataFrame.bfill
进/出同时完成:
df[['entry','exit']]=df[['entry','exit']].ffill(axis=1).bfill(axis=1)
print(df)
visitor entry exit
0 A 2016-02-16 08:46:00 2016-02-16 09:01:00
1 A 2016-02-16 09:20:00 2016-02-16 17:24:00
2 A 2016-02-17 09:12:00 2016-02-17 09:42:00
3 A 2016-02-17 09:55:00 2016-02-17 09:55:00
4 A 2016-02-17 12:42:00 2016-02-17 12:56:00
5 A 2016-02-17 13:02:00 2016-02-17 17:32:00
6 A 2016-02-17 17:44:00 2016-02-17 18:24:00
7 A 2016-02-18 07:59:00 2016-02-18 16:40:00
8 A 2016-02-18 16:53:00 2016-02-18 16:53:00
9 A 2016-02-19 09:11:00 2016-02-19 09:11:00
10 A 2016-02-19 09:27:00 2016-02-19 11:26:00
11 A 2016-02-19 12:28:00 2016-02-19 17:12:00
12 A 2016-02-20 08:44:00 2016-02-20 08:58:00
13 A 2016-02-20 09:16:00 2016-02-20 17:21:00
编辑
DataFrame.notna
+ DataFrame.all
to performance a boolean indexing
过滤到具有 NaT 值的 ros 以计算 diff
的平均值
#filtering valid data
df_valid=df[df.notna().all(axis=1)]
#Calculating diff
time_dif=df_valid[['entry','exit']].diff(axis=1).exit
print(time_dif)
0 00:15:00
1 08:04:00
2 00:30:00
4 00:14:00
5 04:30:00
6 00:40:00
7 08:41:00
10 01:59:00
11 04:44:00
12 00:14:00
13 08:05:00
Name: exit, dtype: timedelta64[ns]
#Calculatin mean
time_dif_mean=time_dif.mean()
print('This is the mean of time in: ', time_dif_mean)
This is the mean of time in: 0 days 03:26:54.545454
用平均值填充缺失值
#roud to seconds( optional)
time_dif_mean_round_second=time_dif_mean.round('s')
df['entry'].fillna(df['exit']-time_dif_mean_round_second,inplace=True)
df['exit'].fillna(df['entry']+time_dif_mean_round_second,inplace=True)
print(df)
输出:
visitor entry exit
0 A 2016-02-16 08:46:00 2016-02-16 09:01:00
1 A 2016-02-16 09:20:00 2016-02-16 17:24:00
2 A 2016-02-17 09:12:00 2016-02-17 09:42:00
3 A 2016-02-17 09:55:00 2016-02-17 13:21:55
4 A 2016-02-17 12:42:00 2016-02-17 12:56:00
5 A 2016-02-17 13:02:00 2016-02-17 17:32:00
6 A 2016-02-17 17:44:00 2016-02-17 18:24:00
7 A 2016-02-18 07:59:00 2016-02-18 16:40:00
8 A 2016-02-18 16:53:00 2016-02-18 20:19:55
9 A 2016-02-19 05:44:05 2016-02-19 09:11:00
10 A 2016-02-19 09:27:00 2016-02-19 11:26:00
11 A 2016-02-19 12:28:00 2016-02-19 17:12:00
12 A 2016-02-20 08:44:00 2016-02-20 08:58:00
13 A 2016-02-20 09:16:00 2016-02-20 17:21:00
使用 Pandas 填充缺失数据的最佳方法是什么。我有一份访客名单,其中缺少退出时间或进入时间。
visitor entry exit
A 16/02/2016 08:46 16/02/2016 09:01
A 16/02/2016 09:20 16/02/2016 17:24
A 17/02/2016 09:12 17/02/2016 09:42
A 17/02/2016 09:55 NaT
A 17/02/2016 12:42 17/02/2016 12:56
A 17/02/2016 13:02 17/02/2016 17:32
A 17/02/2016 17:44 17/02/2016 18:24
A 18/02/2016 07:59 18/02/2016 16:40
A 18/02/2016 16:53 NaT
A NaT 19/02/2016 09:11
A 19/02/2016 09:27 19/02/2016 11:26
A 19/02/2016 12:28 19/02/2016 17:12
A 20/02/2016 08:44 20/02/2016 08:58
A 20/02/2016 09:16 20/02/2016 17:21
您可以使用 DataFrame.ffill
+ DataFrame.bfill
进/出同时完成:
df[['entry','exit']]=df[['entry','exit']].ffill(axis=1).bfill(axis=1)
print(df)
visitor entry exit
0 A 2016-02-16 08:46:00 2016-02-16 09:01:00
1 A 2016-02-16 09:20:00 2016-02-16 17:24:00
2 A 2016-02-17 09:12:00 2016-02-17 09:42:00
3 A 2016-02-17 09:55:00 2016-02-17 09:55:00
4 A 2016-02-17 12:42:00 2016-02-17 12:56:00
5 A 2016-02-17 13:02:00 2016-02-17 17:32:00
6 A 2016-02-17 17:44:00 2016-02-17 18:24:00
7 A 2016-02-18 07:59:00 2016-02-18 16:40:00
8 A 2016-02-18 16:53:00 2016-02-18 16:53:00
9 A 2016-02-19 09:11:00 2016-02-19 09:11:00
10 A 2016-02-19 09:27:00 2016-02-19 11:26:00
11 A 2016-02-19 12:28:00 2016-02-19 17:12:00
12 A 2016-02-20 08:44:00 2016-02-20 08:58:00
13 A 2016-02-20 09:16:00 2016-02-20 17:21:00
编辑
DataFrame.notna
+ DataFrame.all
to performance a boolean indexing
过滤到具有 NaT 值的 ros 以计算 diff
#filtering valid data
df_valid=df[df.notna().all(axis=1)]
#Calculating diff
time_dif=df_valid[['entry','exit']].diff(axis=1).exit
print(time_dif)
0 00:15:00
1 08:04:00
2 00:30:00
4 00:14:00
5 04:30:00
6 00:40:00
7 08:41:00
10 01:59:00
11 04:44:00
12 00:14:00
13 08:05:00
Name: exit, dtype: timedelta64[ns]
#Calculatin mean
time_dif_mean=time_dif.mean()
print('This is the mean of time in: ', time_dif_mean)
This is the mean of time in: 0 days 03:26:54.545454
用平均值填充缺失值
#roud to seconds( optional)
time_dif_mean_round_second=time_dif_mean.round('s')
df['entry'].fillna(df['exit']-time_dif_mean_round_second,inplace=True)
df['exit'].fillna(df['entry']+time_dif_mean_round_second,inplace=True)
print(df)
输出:
visitor entry exit
0 A 2016-02-16 08:46:00 2016-02-16 09:01:00
1 A 2016-02-16 09:20:00 2016-02-16 17:24:00
2 A 2016-02-17 09:12:00 2016-02-17 09:42:00
3 A 2016-02-17 09:55:00 2016-02-17 13:21:55
4 A 2016-02-17 12:42:00 2016-02-17 12:56:00
5 A 2016-02-17 13:02:00 2016-02-17 17:32:00
6 A 2016-02-17 17:44:00 2016-02-17 18:24:00
7 A 2016-02-18 07:59:00 2016-02-18 16:40:00
8 A 2016-02-18 16:53:00 2016-02-18 20:19:55
9 A 2016-02-19 05:44:05 2016-02-19 09:11:00
10 A 2016-02-19 09:27:00 2016-02-19 11:26:00
11 A 2016-02-19 12:28:00 2016-02-19 17:12:00
12 A 2016-02-20 08:44:00 2016-02-20 08:58:00
13 A 2016-02-20 09:16:00 2016-02-20 17:21:00