Pandas 取最接近秒的值并进行插值
Pandas take nearest value to the second and interpolate
我正在寻找转换以下格式的数据框作为示例:
>>>df
vals
2019-08-10 12:03:05 1.0
2019-08-10 12:03:06 NaN
2019-08-10 12:03:07 NaN
2019-08-10 12:03:08 3.0
2019-08-10 12:03:09 4.0
2019-08-10 12:03:10 NaN
2019-08-10 12:03:11 NaN
2019-08-10 12:03:12 5.0
2019-08-10 12:03:13 NaN
2019-08-10 12:03:14 1.0
2019-08-10 12:03:15 NaN
2019-08-10 12:03:16 NaN
2019-08-10 12:03:17 6.0
变成一个如:
>>>df
vals
2019-08-10 12:03:05 1.0
2019-08-10 12:03:06 1.667
2019-08-10 12:03:07 2.333
2019-08-10 12:03:08 3.0
2019-08-10 12:03:09 3.667
2019-08-10 12:03:10 4.333
2019-08-10 12:03:11 5.0
2019-08-10 12:03:12 3.667
2019-08-10 12:03:13 2.333
2019-08-10 12:03:14 1.0
2019-08-10 12:03:15 2.667
2019-08-10 12:03:16 4.333
2019-08-10 12:03:17 6.0
首先对齐数据框的位置如下所示(取最接近每个第 3 个值的值):
>>>df
vals
2019-08-10 12:03:05 1.0
2019-08-10 12:03:06 NaN
2019-08-10 12:03:07 NaN
2019-08-10 12:03:08 3.0
2019-08-10 12:03:09 NaN
2019-08-10 12:03:10 NaN
2019-08-10 12:03:11 5.0
2019-08-10 12:03:12 NaN
2019-08-10 12:03:13 NaN
2019-08-10 12:03:14 1.0
2019-08-10 12:03:15 NaN
2019-08-10 12:03:16 NaN
2019-08-10 12:03:17 6.0
然后在每个值之间进行线性插值以生成最终数据帧。如果间隔超过 2 秒,我不想在这两个值之间进行插值。
这是我到目前为止尝试过的方法:
df.resample('3s').nearest()
产生:
>>> df.resample('3s').nearest()
vals
2019-08-10 12:03:03 1.0
2019-08-10 12:03:06 NaN
2019-08-10 12:03:09 4.0
2019-08-10 12:03:12 5.0
2019-08-10 12:03:15 NaN
还有:
>>> df.resample('2s').nearest()
vals
2019-08-10 12:03:04 1.0
2019-08-10 12:03:06 NaN
2019-08-10 12:03:08 3.0
2019-08-10 12:03:10 NaN
2019-08-10 12:03:12 5.0
2019-08-10 12:03:14 1.0
2019-08-10 12:03:16 NaN
这很清楚 nearest 是一个完整的谎言,或者至少是用词不当,因为最接近 10 的值显然是 4。此外,2019-08-10 12:03:16
处的最终值肯定应该是 6.0
.
这只是试图将值与第二个值对齐,在此之后,简单的 interpolate
似乎可以工作。
感谢任何帮助。
如果你想用最接近的值替换 nan 值,那么你可以使用插值
data['value'] = data['value'].interpolate(method='nearest')
我认为您需要 base
参数来更改采样周期的偏移量,其中 3
的第一个索引值(因为 3 秒)取模 Resampler.first
:
df['new'] = df.resample('3s', base=df.index[0].second % 3).first()
print (df)
vals new
2019-08-10 12:03:05 1.0 1.0
2019-08-10 12:03:06 NaN NaN
2019-08-10 12:03:07 NaN NaN
2019-08-10 12:03:08 3.0 3.0
2019-08-10 12:03:09 4.0 NaN
2019-08-10 12:03:10 NaN NaN
2019-08-10 12:03:11 NaN 5.0
2019-08-10 12:03:12 5.0 NaN
2019-08-10 12:03:13 NaN NaN
2019-08-10 12:03:14 1.0 1.0
2019-08-10 12:03:15 NaN NaN
2019-08-10 12:03:16 NaN NaN
2019-08-10 12:03:17 6.0 6.0
然后迭代:
df['new'] = df['new'].interpolate()
print (df)
vals new
2019-08-10 12:03:05 1.0 1.000000
2019-08-10 12:03:06 NaN 1.666667
2019-08-10 12:03:07 NaN 2.333333
2019-08-10 12:03:08 3.0 3.000000
2019-08-10 12:03:09 4.0 3.666667
2019-08-10 12:03:10 NaN 4.333333
2019-08-10 12:03:11 NaN 5.000000
2019-08-10 12:03:12 5.0 3.666667
2019-08-10 12:03:13 NaN 2.333333
2019-08-10 12:03:14 1.0 1.000000
2019-08-10 12:03:15 NaN 2.666667
2019-08-10 12:03:16 NaN 4.333333
2019-08-10 12:03:17 6.0 6.000000
在索引中添加 2 秒进行测试:
df.index += pd.Timedelta(2, 's')
df['new'] = df.resample('3s', base=df.index[0].second % 3).first()
print (df)
vals new
2019-08-10 12:03:07 1.0 1.0
2019-08-10 12:03:08 NaN NaN
2019-08-10 12:03:09 NaN NaN
2019-08-10 12:03:10 3.0 3.0
2019-08-10 12:03:11 4.0 NaN
2019-08-10 12:03:12 NaN NaN
2019-08-10 12:03:13 NaN 5.0
2019-08-10 12:03:14 5.0 NaN
2019-08-10 12:03:15 NaN NaN
2019-08-10 12:03:16 1.0 1.0
2019-08-10 12:03:17 NaN NaN
2019-08-10 12:03:18 NaN NaN
2019-08-10 12:03:19 6.0 6.0
df1=df.set_index(['Time']).interpolate(method='linear').reset_index()
print(df1)
输出
Time vals
0 2019-08-10 12:03:05 1.000000
1 2019-08-10 12:03:06 1.666667
2 2019-08-10 12:03:07 2.333333
3 2019-08-10 12:03:08 3.000000
4 2019-08-10 12:03:09 4.000000
5 2019-08-10 12:03:10 4.333333
6 2019-08-10 12:03:11 4.666667
7 2019-08-10 12:03:12 5.000000
8 2019-08-10 12:03:13 3.000000
9 2019-08-10 12:03:14 1.000000
10 2019-08-10 12:03:15 2.666667
11 2019-08-10 12:03:16 4.333333
12 2019-08-10 12:03:17 6.000000
我正在寻找转换以下格式的数据框作为示例:
>>>df
vals
2019-08-10 12:03:05 1.0
2019-08-10 12:03:06 NaN
2019-08-10 12:03:07 NaN
2019-08-10 12:03:08 3.0
2019-08-10 12:03:09 4.0
2019-08-10 12:03:10 NaN
2019-08-10 12:03:11 NaN
2019-08-10 12:03:12 5.0
2019-08-10 12:03:13 NaN
2019-08-10 12:03:14 1.0
2019-08-10 12:03:15 NaN
2019-08-10 12:03:16 NaN
2019-08-10 12:03:17 6.0
变成一个如:
>>>df
vals
2019-08-10 12:03:05 1.0
2019-08-10 12:03:06 1.667
2019-08-10 12:03:07 2.333
2019-08-10 12:03:08 3.0
2019-08-10 12:03:09 3.667
2019-08-10 12:03:10 4.333
2019-08-10 12:03:11 5.0
2019-08-10 12:03:12 3.667
2019-08-10 12:03:13 2.333
2019-08-10 12:03:14 1.0
2019-08-10 12:03:15 2.667
2019-08-10 12:03:16 4.333
2019-08-10 12:03:17 6.0
首先对齐数据框的位置如下所示(取最接近每个第 3 个值的值):
>>>df
vals
2019-08-10 12:03:05 1.0
2019-08-10 12:03:06 NaN
2019-08-10 12:03:07 NaN
2019-08-10 12:03:08 3.0
2019-08-10 12:03:09 NaN
2019-08-10 12:03:10 NaN
2019-08-10 12:03:11 5.0
2019-08-10 12:03:12 NaN
2019-08-10 12:03:13 NaN
2019-08-10 12:03:14 1.0
2019-08-10 12:03:15 NaN
2019-08-10 12:03:16 NaN
2019-08-10 12:03:17 6.0
然后在每个值之间进行线性插值以生成最终数据帧。如果间隔超过 2 秒,我不想在这两个值之间进行插值。
这是我到目前为止尝试过的方法:
df.resample('3s').nearest()
产生:
>>> df.resample('3s').nearest()
vals
2019-08-10 12:03:03 1.0
2019-08-10 12:03:06 NaN
2019-08-10 12:03:09 4.0
2019-08-10 12:03:12 5.0
2019-08-10 12:03:15 NaN
还有:
>>> df.resample('2s').nearest()
vals
2019-08-10 12:03:04 1.0
2019-08-10 12:03:06 NaN
2019-08-10 12:03:08 3.0
2019-08-10 12:03:10 NaN
2019-08-10 12:03:12 5.0
2019-08-10 12:03:14 1.0
2019-08-10 12:03:16 NaN
这很清楚 nearest 是一个完整的谎言,或者至少是用词不当,因为最接近 10 的值显然是 4。此外,2019-08-10 12:03:16
处的最终值肯定应该是 6.0
.
这只是试图将值与第二个值对齐,在此之后,简单的 interpolate
似乎可以工作。
感谢任何帮助。
如果你想用最接近的值替换 nan 值,那么你可以使用插值
data['value'] = data['value'].interpolate(method='nearest')
我认为您需要 base
参数来更改采样周期的偏移量,其中 3
的第一个索引值(因为 3 秒)取模 Resampler.first
:
df['new'] = df.resample('3s', base=df.index[0].second % 3).first()
print (df)
vals new
2019-08-10 12:03:05 1.0 1.0
2019-08-10 12:03:06 NaN NaN
2019-08-10 12:03:07 NaN NaN
2019-08-10 12:03:08 3.0 3.0
2019-08-10 12:03:09 4.0 NaN
2019-08-10 12:03:10 NaN NaN
2019-08-10 12:03:11 NaN 5.0
2019-08-10 12:03:12 5.0 NaN
2019-08-10 12:03:13 NaN NaN
2019-08-10 12:03:14 1.0 1.0
2019-08-10 12:03:15 NaN NaN
2019-08-10 12:03:16 NaN NaN
2019-08-10 12:03:17 6.0 6.0
然后迭代:
df['new'] = df['new'].interpolate()
print (df)
vals new
2019-08-10 12:03:05 1.0 1.000000
2019-08-10 12:03:06 NaN 1.666667
2019-08-10 12:03:07 NaN 2.333333
2019-08-10 12:03:08 3.0 3.000000
2019-08-10 12:03:09 4.0 3.666667
2019-08-10 12:03:10 NaN 4.333333
2019-08-10 12:03:11 NaN 5.000000
2019-08-10 12:03:12 5.0 3.666667
2019-08-10 12:03:13 NaN 2.333333
2019-08-10 12:03:14 1.0 1.000000
2019-08-10 12:03:15 NaN 2.666667
2019-08-10 12:03:16 NaN 4.333333
2019-08-10 12:03:17 6.0 6.000000
在索引中添加 2 秒进行测试:
df.index += pd.Timedelta(2, 's')
df['new'] = df.resample('3s', base=df.index[0].second % 3).first()
print (df)
vals new
2019-08-10 12:03:07 1.0 1.0
2019-08-10 12:03:08 NaN NaN
2019-08-10 12:03:09 NaN NaN
2019-08-10 12:03:10 3.0 3.0
2019-08-10 12:03:11 4.0 NaN
2019-08-10 12:03:12 NaN NaN
2019-08-10 12:03:13 NaN 5.0
2019-08-10 12:03:14 5.0 NaN
2019-08-10 12:03:15 NaN NaN
2019-08-10 12:03:16 1.0 1.0
2019-08-10 12:03:17 NaN NaN
2019-08-10 12:03:18 NaN NaN
2019-08-10 12:03:19 6.0 6.0
df1=df.set_index(['Time']).interpolate(method='linear').reset_index()
print(df1)
输出
Time vals
0 2019-08-10 12:03:05 1.000000
1 2019-08-10 12:03:06 1.666667
2 2019-08-10 12:03:07 2.333333
3 2019-08-10 12:03:08 3.000000
4 2019-08-10 12:03:09 4.000000
5 2019-08-10 12:03:10 4.333333
6 2019-08-10 12:03:11 4.666667
7 2019-08-10 12:03:12 5.000000
8 2019-08-10 12:03:13 3.000000
9 2019-08-10 12:03:14 1.000000
10 2019-08-10 12:03:15 2.666667
11 2019-08-10 12:03:16 4.333333
12 2019-08-10 12:03:17 6.000000