在 Python 中确定重叠时间序列的最有效方法
Most efficient way to determine overlapping timeseries in Python
我正在尝试使用 python 的 pandas 库来确定两个时间序列重叠的时间百分比。数据是非同步的,因此每个数据点的时间不对齐。这是一个例子:
时间序列 1
2016-10-05 11:50:02.000734 0.50
2016-10-05 11:50:03.000033 0.25
2016-10-05 11:50:10.000479 0.50
2016-10-05 11:50:15.000234 0.25
2016-10-05 11:50:37.000199 0.50
2016-10-05 11:50:49.000401 0.50
2016-10-05 11:50:51.000362 0.25
2016-10-05 11:50:53.000424 0.75
2016-10-05 11:50:53.000982 0.25
2016-10-05 11:50:58.000606 0.75
时间序列 2
2016-10-05 11:50:07.000537 0.50
2016-10-05 11:50:11.000994 0.50
2016-10-05 11:50:19.000181 0.50
2016-10-05 11:50:35.000578 0.50
2016-10-05 11:50:46.000761 0.50
2016-10-05 11:50:49.000295 0.75
2016-10-05 11:50:51.000835 0.75
2016-10-05 11:50:55.000792 0.25
2016-10-05 11:50:55.000904 0.75
2016-10-05 11:50:57.000444 0.75
假设系列在下一次更改之前保持其值,确定它们具有相同值的时间百分比的最有效方法是什么?
例子
让我们计算这些系列重叠的时间,从 11:50:07.000537 开始到 2016-10-05 11:50:57.000444 0.75 结束,因为我们有该时期两个系列的数据。重叠时间:
- 11:50:10.000479 - 11:50:15.000234(均为 0.5)4.999755 秒
- 11:50:37.000199 - 11:50:49.000295(均为 0.5)12.000096 秒
- 11:50:53.000424 - 11:50:53.000982(均为 0.75)0.000558 秒
- 11:50:55.000792 - 11:50:55.000904(均为 0.25)0.000112 秒
结果(4.999755+12.000096+0.000558+0.000112) / 49.999907 = 34%
其中一个问题是我的实际时间序列有更多数据,例如 1000 - 10000 个观测值,我需要 运行 更多对。我考虑过向前填充一个系列,然后简单地比较行并将匹配项总数除以总行数,但我认为这不会非常有效。
很酷的问题。我使用 pandas 或 numpy 强制执行此 w/out,但我得到了您的答案(感谢您解决)。我没有在其他任何东西上测试过它。我也不知道它有多快,因为它只遍历每个数据帧一次,但不进行任何矢量化。
import pandas as pd
#############################################################################
#Preparing the dataframes
times_1 = ["2016-10-05 11:50:02.000734","2016-10-05 11:50:03.000033",
"2016-10-05 11:50:10.000479","2016-10-05 11:50:15.000234",
"2016-10-05 11:50:37.000199","2016-10-05 11:50:49.000401",
"2016-10-05 11:50:51.000362","2016-10-05 11:50:53.000424",
"2016-10-05 11:50:53.000982","2016-10-05 11:50:58.000606"]
times_1 = [pd.Timestamp(t) for t in times_1]
vals_1 = [0.50,0.25,0.50,0.25,0.50,0.50,0.25,0.75,0.25,0.75]
times_2 = ["2016-10-05 11:50:07.000537","2016-10-05 11:50:11.000994",
"2016-10-05 11:50:19.000181","2016-10-05 11:50:35.000578",
"2016-10-05 11:50:46.000761","2016-10-05 11:50:49.000295",
"2016-10-05 11:50:51.000835","2016-10-05 11:50:55.000792",
"2016-10-05 11:50:55.000904","2016-10-05 11:50:57.000444"]
times_2 = [pd.Timestamp(t) for t in times_2]
vals_2 = [0.50,0.50,0.50,0.50,0.50,0.75,0.75,0.25,0.75,0.75]
data_1 = pd.DataFrame({"time":times_1,"vals":vals_1})
data_2 = pd.DataFrame({"time":times_2,"vals":vals_2})
#############################################################################
shared_time = 0 #Keep running tally of shared time
t1_ind = 0 #Pointer to row in data_1 dataframe
t2_ind = 0 #Pointer to row in data_2 dataframe
#Loop through both dataframes once, incrementing either the t1 or t2 index
#Stop one before the end of both since do +1 indexing in loop
while t1_ind < len(data_1.time)-1 and t2_ind < len(data_2.time)-1:
#Get val1 and val2
val1,val2 = data_1.vals[t1_ind], data_2.vals[t2_ind]
#Get the start and stop of the current time window
t1_start,t1_stop = data_1.time[t1_ind], data_1.time[t1_ind+1]
t2_start,t2_stop = data_2.time[t2_ind], data_2.time[t2_ind+1]
#If the start of time window 2 is in time window 1
if val1 == val2 and (t1_start <= t2_start <= t1_stop):
shared_time += (min(t1_stop,t2_stop)-t2_start).total_seconds()
t1_ind += 1
#If the start of time window 1 is in time window 2
elif val1 == val2 and t2_start <= t1_start <= t2_stop:
shared_time += (min(t1_stop,t2_stop)-t1_start).total_seconds()
t2_ind += 1
#If there is no time window overlap and time window 2 is larger
elif t1_start < t2_start:
t1_ind += 1
#If there is no time window overlap and time window 1 is larger
else:
t2_ind += 1
#How I calculated the maximum possible shared time (not pretty)
shared_start = max(data_1.time[0],data_2.time[0])
shared_stop = min(data_1.time.iloc[-1],data_2.time.iloc[-1])
max_possible_shared = (shared_stop-shared_start).total_seconds()
#Print output
print "Shared time:",shared_time
print "Total possible shared:",max_possible_shared
print "Percent shared:",shared_time*100/max_possible_shared,"%"
输出:
Shared time: 17.000521
Total possible shared: 49.999907
Percent shared: 34.0011052421 %
设置
创建 2 个时间序列
from StringIO import StringIO
import pandas as pd
txt1 = """2016-10-05 11:50:02.000734 0.50
2016-10-05 11:50:03.000033 0.25
2016-10-05 11:50:10.000479 0.50
2016-10-05 11:50:15.000234 0.25
2016-10-05 11:50:37.000199 0.50
2016-10-05 11:50:49.000401 0.50
2016-10-05 11:50:51.000362 0.25
2016-10-05 11:50:53.000424 0.75
2016-10-05 11:50:53.000982 0.25
2016-10-05 11:50:58.000606 0.75"""
s1 = pd.read_csv(StringIO(txt1), sep='\s{2,}', engine='python',
parse_dates=[0], index_col=0, header=None,
squeeze=True).rename('s1').rename_axis(None)
txt2 = """2016-10-05 11:50:07.000537 0.50
2016-10-05 11:50:11.000994 0.50
2016-10-05 11:50:19.000181 0.50
2016-10-05 11:50:35.000578 0.50
2016-10-05 11:50:46.000761 0.50
2016-10-05 11:50:49.000295 0.75
2016-10-05 11:50:51.000835 0.75
2016-10-05 11:50:55.000792 0.25
2016-10-05 11:50:55.000904 0.75
2016-10-05 11:50:57.000444 0.75"""
s2 = pd.read_csv(StringIO(txt2), sep='\s{2,}', engine='python',
parse_dates=[0], index_col=0, header=None,
squeeze=True).rename('s2').rename_axis(None)
TL;DR
df = pd.concat([s1, s2], axis=1).ffill().dropna()
overlap = df.index.to_series().diff().shift(-1) \
.fillna(0).groupby(df.s1.eq(df.s2)).sum()
overlap.div(overlap.sum())
False 0.666657
True 0.333343
Name: duration, dtype: float64
说明
建立基地pd.DataFrame
df
- 使用
pd.concat
对齐索引
- 使用
ffill
让值向前传播
- 使用
dropna
在另一个系列开始之前删除一个系列的值
df = pd.concat([s1, s2], axis=1).ffill().dropna()
df
计算'duration'
从当前时间戳到下一个
df['duration'] = df.index.to_series().diff().shift(-1).fillna(0)
df
计算重叠
df.s1.eq(df.s2)
给出 s1
与 s2
重叠时的布尔序列
- 在
True
和 False
时使用布尔系列上方的 groupby
来汇总总持续时间
overlap = df.groupby(df.s1.eq(df.s2)).duration.sum()
overlap
False 00:00:33.999548
True 00:00:17.000521
Name: duration, dtype: timedelta64[ns]
具有相同值的时间百分比
overlap.div(overlap.sum())
False 0.666657
True 0.333343
Name: duration, dtype: float64
我正在尝试使用 python 的 pandas 库来确定两个时间序列重叠的时间百分比。数据是非同步的,因此每个数据点的时间不对齐。这是一个例子:
时间序列 1
2016-10-05 11:50:02.000734 0.50
2016-10-05 11:50:03.000033 0.25
2016-10-05 11:50:10.000479 0.50
2016-10-05 11:50:15.000234 0.25
2016-10-05 11:50:37.000199 0.50
2016-10-05 11:50:49.000401 0.50
2016-10-05 11:50:51.000362 0.25
2016-10-05 11:50:53.000424 0.75
2016-10-05 11:50:53.000982 0.25
2016-10-05 11:50:58.000606 0.75
时间序列 2
2016-10-05 11:50:07.000537 0.50
2016-10-05 11:50:11.000994 0.50
2016-10-05 11:50:19.000181 0.50
2016-10-05 11:50:35.000578 0.50
2016-10-05 11:50:46.000761 0.50
2016-10-05 11:50:49.000295 0.75
2016-10-05 11:50:51.000835 0.75
2016-10-05 11:50:55.000792 0.25
2016-10-05 11:50:55.000904 0.75
2016-10-05 11:50:57.000444 0.75
假设系列在下一次更改之前保持其值,确定它们具有相同值的时间百分比的最有效方法是什么?
例子
让我们计算这些系列重叠的时间,从 11:50:07.000537 开始到 2016-10-05 11:50:57.000444 0.75 结束,因为我们有该时期两个系列的数据。重叠时间:
- 11:50:10.000479 - 11:50:15.000234(均为 0.5)4.999755 秒
- 11:50:37.000199 - 11:50:49.000295(均为 0.5)12.000096 秒
- 11:50:53.000424 - 11:50:53.000982(均为 0.75)0.000558 秒
- 11:50:55.000792 - 11:50:55.000904(均为 0.25)0.000112 秒
结果(4.999755+12.000096+0.000558+0.000112) / 49.999907 = 34%
其中一个问题是我的实际时间序列有更多数据,例如 1000 - 10000 个观测值,我需要 运行 更多对。我考虑过向前填充一个系列,然后简单地比较行并将匹配项总数除以总行数,但我认为这不会非常有效。
很酷的问题。我使用 pandas 或 numpy 强制执行此 w/out,但我得到了您的答案(感谢您解决)。我没有在其他任何东西上测试过它。我也不知道它有多快,因为它只遍历每个数据帧一次,但不进行任何矢量化。
import pandas as pd
#############################################################################
#Preparing the dataframes
times_1 = ["2016-10-05 11:50:02.000734","2016-10-05 11:50:03.000033",
"2016-10-05 11:50:10.000479","2016-10-05 11:50:15.000234",
"2016-10-05 11:50:37.000199","2016-10-05 11:50:49.000401",
"2016-10-05 11:50:51.000362","2016-10-05 11:50:53.000424",
"2016-10-05 11:50:53.000982","2016-10-05 11:50:58.000606"]
times_1 = [pd.Timestamp(t) for t in times_1]
vals_1 = [0.50,0.25,0.50,0.25,0.50,0.50,0.25,0.75,0.25,0.75]
times_2 = ["2016-10-05 11:50:07.000537","2016-10-05 11:50:11.000994",
"2016-10-05 11:50:19.000181","2016-10-05 11:50:35.000578",
"2016-10-05 11:50:46.000761","2016-10-05 11:50:49.000295",
"2016-10-05 11:50:51.000835","2016-10-05 11:50:55.000792",
"2016-10-05 11:50:55.000904","2016-10-05 11:50:57.000444"]
times_2 = [pd.Timestamp(t) for t in times_2]
vals_2 = [0.50,0.50,0.50,0.50,0.50,0.75,0.75,0.25,0.75,0.75]
data_1 = pd.DataFrame({"time":times_1,"vals":vals_1})
data_2 = pd.DataFrame({"time":times_2,"vals":vals_2})
#############################################################################
shared_time = 0 #Keep running tally of shared time
t1_ind = 0 #Pointer to row in data_1 dataframe
t2_ind = 0 #Pointer to row in data_2 dataframe
#Loop through both dataframes once, incrementing either the t1 or t2 index
#Stop one before the end of both since do +1 indexing in loop
while t1_ind < len(data_1.time)-1 and t2_ind < len(data_2.time)-1:
#Get val1 and val2
val1,val2 = data_1.vals[t1_ind], data_2.vals[t2_ind]
#Get the start and stop of the current time window
t1_start,t1_stop = data_1.time[t1_ind], data_1.time[t1_ind+1]
t2_start,t2_stop = data_2.time[t2_ind], data_2.time[t2_ind+1]
#If the start of time window 2 is in time window 1
if val1 == val2 and (t1_start <= t2_start <= t1_stop):
shared_time += (min(t1_stop,t2_stop)-t2_start).total_seconds()
t1_ind += 1
#If the start of time window 1 is in time window 2
elif val1 == val2 and t2_start <= t1_start <= t2_stop:
shared_time += (min(t1_stop,t2_stop)-t1_start).total_seconds()
t2_ind += 1
#If there is no time window overlap and time window 2 is larger
elif t1_start < t2_start:
t1_ind += 1
#If there is no time window overlap and time window 1 is larger
else:
t2_ind += 1
#How I calculated the maximum possible shared time (not pretty)
shared_start = max(data_1.time[0],data_2.time[0])
shared_stop = min(data_1.time.iloc[-1],data_2.time.iloc[-1])
max_possible_shared = (shared_stop-shared_start).total_seconds()
#Print output
print "Shared time:",shared_time
print "Total possible shared:",max_possible_shared
print "Percent shared:",shared_time*100/max_possible_shared,"%"
输出:
Shared time: 17.000521
Total possible shared: 49.999907
Percent shared: 34.0011052421 %
设置
创建 2 个时间序列
from StringIO import StringIO
import pandas as pd
txt1 = """2016-10-05 11:50:02.000734 0.50
2016-10-05 11:50:03.000033 0.25
2016-10-05 11:50:10.000479 0.50
2016-10-05 11:50:15.000234 0.25
2016-10-05 11:50:37.000199 0.50
2016-10-05 11:50:49.000401 0.50
2016-10-05 11:50:51.000362 0.25
2016-10-05 11:50:53.000424 0.75
2016-10-05 11:50:53.000982 0.25
2016-10-05 11:50:58.000606 0.75"""
s1 = pd.read_csv(StringIO(txt1), sep='\s{2,}', engine='python',
parse_dates=[0], index_col=0, header=None,
squeeze=True).rename('s1').rename_axis(None)
txt2 = """2016-10-05 11:50:07.000537 0.50
2016-10-05 11:50:11.000994 0.50
2016-10-05 11:50:19.000181 0.50
2016-10-05 11:50:35.000578 0.50
2016-10-05 11:50:46.000761 0.50
2016-10-05 11:50:49.000295 0.75
2016-10-05 11:50:51.000835 0.75
2016-10-05 11:50:55.000792 0.25
2016-10-05 11:50:55.000904 0.75
2016-10-05 11:50:57.000444 0.75"""
s2 = pd.read_csv(StringIO(txt2), sep='\s{2,}', engine='python',
parse_dates=[0], index_col=0, header=None,
squeeze=True).rename('s2').rename_axis(None)
TL;DR
df = pd.concat([s1, s2], axis=1).ffill().dropna()
overlap = df.index.to_series().diff().shift(-1) \
.fillna(0).groupby(df.s1.eq(df.s2)).sum()
overlap.div(overlap.sum())
False 0.666657
True 0.333343
Name: duration, dtype: float64
说明
建立基地pd.DataFrame
df
- 使用
pd.concat
对齐索引 - 使用
ffill
让值向前传播 - 使用
dropna
在另一个系列开始之前删除一个系列的值
df = pd.concat([s1, s2], axis=1).ffill().dropna()
df
计算'duration'
从当前时间戳到下一个
df['duration'] = df.index.to_series().diff().shift(-1).fillna(0)
df
计算重叠
df.s1.eq(df.s2)
给出s1
与s2
重叠时的布尔序列
- 在
True
和False
时使用布尔系列上方的
groupby
来汇总总持续时间
overlap = df.groupby(df.s1.eq(df.s2)).duration.sum()
overlap
False 00:00:33.999548
True 00:00:17.000521
Name: duration, dtype: timedelta64[ns]
具有相同值的时间百分比
overlap.div(overlap.sum())
False 0.666657
True 0.333343
Name: duration, dtype: float64