计算时间差，如果差值大于一个小时，标记为'missing'，在该区域的折线图中绘制差距

Question

我在 python 中有一个基本的 pandas 数据框，它接收数据并绘制折线图。每个数据点都涉及一个时间。如果数据文件一切正常，理想情况下每个时间戳彼此相差大约 30 分钟。在某些情况下，超过一小时没有数据通过。在这些时间里，我想将这个时间范围标记为 'missing' 并绘制一个不连续的折线图，公然显示数据丢失的地方。

由于问题非常具体，我很难弄清楚如何执行此操作甚至搜索解决方案。数据 'live' 不断更新，因此我不能只查明某个区域并进行编辑作为解决方法。

看起来像这样的东西：

Example

用于创建日期时间列的代码：

#convert first time columns into one datetime column
df['datetime'] = pd.to_datetime(df[['year', 'month', 'day', 'hour', 'minute', 'second']])

我已经弄明白了如何计算时差，这涉及到创建一个新列。这是以防万一的代码：

df['timediff'] = (df['datetime']-df['datetime'].shift().fillna(pd.to_datetime("00:00:00", format="%H:%M:%S")))

基本了解数据框：

datetime               l1    l2    l3
2019-02-03 01:52:16   0.1   0.2   0.4
2019-02-03 02:29:26   0.1   0.3   0.6
2019-02-03 02:48:03   0.1   0.3   0.6
2019-02-03 04:48:52   0.3   0.8   1.4
2019-02-03 05:25:59   0.4   1.1   1.7
2019-02-03 05:44:34   0.4   1.3   2.2

我只是不确定如何创建一个涉及时差的不连续 'live' 图。

提前致谢。

Answer 1

不完全是您想要的，但快速而优雅的解决方案是对数据重新采样。

df = df.set_index('datetime')
df

                      l1   l2   l3
datetime                          
2019-02-03 01:52:16  0.1  0.2  0.4
2019-02-03 02:29:26  0.1  0.3  0.6
2019-02-03 02:48:03  0.1  0.3  0.6
2019-02-03 04:48:52  0.3  0.8  1.4
2019-02-03 05:25:59  0.4  1.1  1.7
2019-02-03 05:44:34  0.4  1.3  2.2

df.resample('30T').mean()['l1'].plot(marker='*')

如果您绝对需要准确地绘制每个样本，您可以在连续时间戳之间的差异超过某个阈值的地方分割数据，并分别绘制每个块。

from datetime import timedelta

# get difference between consecutive timestamps
dt = df.index.to_series()
td = dt - dt.shift()

# generate a new group index every time the time difference exceeds
# an hour
gp = np.cumsum(td > timedelta(hours=1))

# get current axes, plot all groups on the same axes
ax = plt.gca()
for _, chunk in df.groupby(gp):
    chunk['l1'].plot(marker='*', ax=ax)

或者，您可以将 "holes" 注入您的数据。

# find samples which occurred more than an hour after the previous
# sample
holes = df.loc[td > timedelta(hours=1)]

# "holes" occur just before these samples
holes.index -= timedelta(microseconds=1)

# append holes to the data, set values to NaN
df = df.append(holes)
df.loc[holes.index] = np.nan

# plot series
df['l1'].plot(marker='*')

Answer 2

Edit: @Igor Raush gave a better answer, but I am leaving it anyway as the visualization is a bit different.

看看对你有没有帮助：

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Track the time delta in seconds
# I used total_seconds() and not seconds as seconds are limited to the amount of secs in one day
df['timediff'] = (df['datetime'] - df['datetime'].shift(1)).dt.total_seconds().cumsum().fillna(0)
# Create a dataframe of all the possible seconds in the time range
all_times_df = pd.DataFrame(np.arange(df['timediff'].min(), df['timediff'].max()), columns=['timediff']).set_index('timediff')
# Join the dataframes and fill nulls with 0s, so the values change only where data has been received
live_df = all_times_df.join(df.set_index('timediff')).ffill()
# Plot only your desired columns
live_df[['l1', 'l3']].plot()
plt.show()

Answer 3

使用我的新 timediff 列和 df.loc 函数解决了。

df['timediff'] = (df['datetime']-df['datetime'].shift().fillna(pd.to_datetime("00:00:00", format="%H:%M:%S")))

有了这个，我能够收集每一行的时差。

然后使用 df.loc，我能够在 timediff 大于一个小时的 l1 和 l2 列中找到值，然后将其设为 nan。结果是那个时间点的情节中缺少一行，就像我想要的那样。

missing_l1 = df['l1'].loc[df['timediff'] > timedelta(hours=1)] = np.nan
missing_l2 = df['l2'].loc[df['timediff'] > timedelta(hours=1)] = np.nan

计算时间差，如果差值大于一个小时，标记为'missing'，在该区域的折线图中绘制差距

Calculate time difference, if difference greater than an hour, mark as 'missing', plot gap in line graph in that area

python

time

plot

linegraph

pandas