Pandas 两个日期时间文件之间的系列
Pandas series between two date time files
我的问题是关于使用 Pandas 时间序列。
我有一个文件 (Spots),其中包含 pandas 一个月数据的时间序列,范围为 7.5 秒。
示例:
2016-11-01 00:00:00,0
2016-11-01 00:00:07.500000,1
2016-11-01 00:00:15,2
2016-11-01 00:00:22.500000,3
2016-11-01 00:00:30,4
另一个文件(目标)只有时间信息。
示例:
2016-11-01 00:00:05
2016-11-01 00:00:07
2016-11-01 00:00:23
2016-11-01 00:00:25
我想查看目标日期时间属于哪个地点:
上例中的输出:
2016-11-01 00:00:00,0 '\t' count of targets in this spot = 2
2016-11-01 00:00:07.500000,1 '\t' count of targets in this spot = 0
2016-11-01 00:00:15,2 '\t' count of targets in this spot = 0
2016-11-01 00:00:22.500000,3 '\t' count of targets in this spot = 0
2016-11-01 00:00:30,4 '\t' count of targets in this spot = 2
在此先感谢您。有点让我知道这是否清楚,否则我可以尝试解释更多。
这是我的建议。首先,将另一列添加到目标框架。这将使在未来合并后识别目标成为可能:
target['T'] = 1
连接目标和点并按时间排序:
both = pd.concat([spots,target]).sort_values(0)
# 0 1 T
#0 2016-11-01 00:00:00.000 0.0 NaN
#0 2016-11-01 00:00:05.000 NaN 1.0
#1 2016-11-01 00:00:07.000 NaN 1.0
#1 2016-11-01 00:00:07.500 1.0 NaN
#2 2016-11-01 00:00:15.000 2.0 NaN
#3 2016-11-01 00:00:22.500 3.0 NaN
#2 2016-11-01 00:00:23.000 NaN 1.0
#3 2016-11-01 00:00:25.000 NaN 1.0
#4 2016-11-01 00:00:30.000 4.0 NaN
前向填充点 ID:
both[1] = both[1].fillna(method='ffill').astype(int)
# 0 1 T
#0 2016-11-01 00:00:00.000 0 NaN
#0 2016-11-01 00:00:05.000 0 1.0
#1 2016-11-01 00:00:07.000 0 1.0
#1 2016-11-01 00:00:07.500 1 NaN
#2 2016-11-01 00:00:15.000 2 NaN
#3 2016-11-01 00:00:22.500 3 NaN
#2 2016-11-01 00:00:23.000 3 1.0
#3 2016-11-01 00:00:25.000 3 1.0
#4 2016-11-01 00:00:30.000 4 NaN
Select原目标行数和列数:
both[both['T']==1][[0,1]]
# 0 1
#0 2016-11-01 00:00:05 0
#1 2016-11-01 00:00:07 0
#2 2016-11-01 00:00:23 3
#3 2016-11-01 00:00:25 3
如果要按点计算目标,请使用 groupby()
:
both.groupby(1).count()['T']
#1
#0 2
#1 0
#2 0
#3 2
#4 0
让我们使用 merge_ordered
、fillna
和 groupby
:
输入:
df_spots
Date Value
0 2016-11-01 00:00:00.000 0
1 2016-11-01 00:00:07.500 1
2 2016-11-01 00:00:15.000 2
3 2016-11-01 00:00:22.500 3
4 2016-11-01 00:00:30.000 4
df_target
Date
0 2016-11-01 00:00:05
1 2016-11-01 00:00:07
2 2016-11-01 00:00:23
3 2016-11-01 00:00:25
代码:
merged_df = pd.merge_ordered(df_spots, df_target, on = 'Date')
df_out = (merged_df.groupby(by=merged_df['Value']
.fillna(method='ffill'), as_index=False)
.agg({'Date':'first',
'Value':{'first':'first','count':lambda x:len(x)-1}}))
输出:
df_out
Date Value
first first count
0 2016-11-01 00:00:00.000 0.0 2.0
1 2016-11-01 00:00:07.500 1.0 0.0
2 2016-11-01 00:00:15.000 2.0 0.0
3 2016-11-01 00:00:22.500 3.0 2.0
4 2016-11-01 00:00:30.000 4.0 0.0
结合使用 np.searchsorted
和 pd.value_counts
以及其他一些东西。
idx = Spots.index.to_series()
i = idx.values
t = Target.Date.values
m = pd.value_counts(i[i.searchsorted(t) - 1]).to_dict()
Spots.assign(TargetCount=idx.map(lambda x: m.get(x, 0)))
Value TargetCount
Date
2016-11-01 00:00:00.000 0 2
2016-11-01 00:00:07.500 1 0
2016-11-01 00:00:15.000 2 0
2016-11-01 00:00:22.500 3 2
2016-11-01 00:00:30.000 4 0
工作原理
idx
是 Spots
的索引变成了 pd.Series
因为我以后要用 pd.Series.map
.
i
是基础 numpy
数组,我将使用 执行 searchsorted
操作
t
等同于 i
... searchsorted
的一部分
searchsorted
将遍历右数组中的每个元素,并找到相对于右数组应该插入该元素的位置。此信息可用于查找元素所属的 "bin"。然后我减去一个以与适当的索引对齐
- 然后我执行
pd.value_counts
来计算它们
- 使用
map
建立新栏目。
设置
from io import StringIO
import pandas as pd
tx1 = """2016-11-01 00:00:00,0
2016-11-01 00:00:07.500000,1
2016-11-01 00:00:15,2
2016-11-01 00:00:22.500000,3
2016-11-01 00:00:30,4"""
tx2 = """2016-11-01 00:00:05
2016-11-01 00:00:07
2016-11-01 00:00:23
2016-11-01 00:00:25"""
Spots = pd.read_csv(StringIO(tx1), parse_dates=[0], index_col=0, names=['Date', 'Value'])
Target = pd.read_csv(StringIO(tx2), parse_dates=[0], names=['Date'])
使用pandasmerge_asof
(注意,所有时间值都必须排序 - 可能必须先排序):
设置 ~~~~~~~~
import pandas as pd
# make date_range with 1 sec interval (fake targets)
rng = pd.date_range('2016-11-01', periods=100, freq='S')
# resample to make 7.5 sec intervals (fake spot bins)
ts = pd.Series(np.arange(100), index=rng)
ts_vals = ts.resample('7500L').asfreq().index
df_spots = pd.DataFrame({'spot': np.arange(len(ts_vals)), 'bin': ts_vals})
df_spots.head()
bin spot
0 2016-11-01 00:00:00.000 0
1 2016-11-01 00:00:07.500 1
2 2016-11-01 00:00:15.000 2
3 2016-11-01 00:00:22.500 3
4 2016-11-01 00:00:30.000 4
df_targets = pd.DataFrame(rng, columns=['tgt'])
df_targets.head()
tgt
0 2016-11-01 00:00:00
1 2016-11-01 00:00:01
2 2016-11-01 00:00:02
3 2016-11-01 00:00:03
4 2016-11-01 00:00:04
解决方案~~~~~~~
# this will produce spot membership for targets
df = pd.merge_asof(df_targets, df_spots, left_on='tgt', right_on='bin')
df.head()
tgt bin spot
0 2016-11-01 00:00:00 2016-11-01 00:00:00.000 0
1 2016-11-01 00:00:01 2016-11-01 00:00:00.000 0
2 2016-11-01 00:00:02 2016-11-01 00:00:00.000 0
3 2016-11-01 00:00:03 2016-11-01 00:00:00.000 0
4 2016-11-01 00:00:04 2016-11-01 00:00:00.000 0
5 2016-11-01 00:00:05 2016-11-01 00:00:00.000 0
6 2016-11-01 00:00:06 2016-11-01 00:00:00.000 0
7 2016-11-01 00:00:07 2016-11-01 00:00:00.000 0
8 2016-11-01 00:00:08 2016-11-01 00:00:07.500 1
9 2016-11-01 00:00:09 2016-11-01 00:00:07.500 1
# for spot counts...
df_counts = pd.DataFrame(df.groupby('bin')['spot'].count())
df_counts.head()
spot
bin
2016-11-01 00:00:00.000 8
2016-11-01 00:00:07.500 7
2016-11-01 00:00:15.000 8
2016-11-01 00:00:22.500 7
2016-11-01 00:00:30.000 8
我的问题是关于使用 Pandas 时间序列。
我有一个文件 (Spots),其中包含 pandas 一个月数据的时间序列,范围为 7.5 秒。 示例:
2016-11-01 00:00:00,0
2016-11-01 00:00:07.500000,1
2016-11-01 00:00:15,2
2016-11-01 00:00:22.500000,3
2016-11-01 00:00:30,4
另一个文件(目标)只有时间信息。
示例:
2016-11-01 00:00:05
2016-11-01 00:00:07
2016-11-01 00:00:23
2016-11-01 00:00:25
我想查看目标日期时间属于哪个地点: 上例中的输出:
2016-11-01 00:00:00,0 '\t' count of targets in this spot = 2
2016-11-01 00:00:07.500000,1 '\t' count of targets in this spot = 0
2016-11-01 00:00:15,2 '\t' count of targets in this spot = 0
2016-11-01 00:00:22.500000,3 '\t' count of targets in this spot = 0
2016-11-01 00:00:30,4 '\t' count of targets in this spot = 2
在此先感谢您。有点让我知道这是否清楚,否则我可以尝试解释更多。
这是我的建议。首先,将另一列添加到目标框架。这将使在未来合并后识别目标成为可能:
target['T'] = 1
连接目标和点并按时间排序:
both = pd.concat([spots,target]).sort_values(0)
# 0 1 T
#0 2016-11-01 00:00:00.000 0.0 NaN
#0 2016-11-01 00:00:05.000 NaN 1.0
#1 2016-11-01 00:00:07.000 NaN 1.0
#1 2016-11-01 00:00:07.500 1.0 NaN
#2 2016-11-01 00:00:15.000 2.0 NaN
#3 2016-11-01 00:00:22.500 3.0 NaN
#2 2016-11-01 00:00:23.000 NaN 1.0
#3 2016-11-01 00:00:25.000 NaN 1.0
#4 2016-11-01 00:00:30.000 4.0 NaN
前向填充点 ID:
both[1] = both[1].fillna(method='ffill').astype(int)
# 0 1 T
#0 2016-11-01 00:00:00.000 0 NaN
#0 2016-11-01 00:00:05.000 0 1.0
#1 2016-11-01 00:00:07.000 0 1.0
#1 2016-11-01 00:00:07.500 1 NaN
#2 2016-11-01 00:00:15.000 2 NaN
#3 2016-11-01 00:00:22.500 3 NaN
#2 2016-11-01 00:00:23.000 3 1.0
#3 2016-11-01 00:00:25.000 3 1.0
#4 2016-11-01 00:00:30.000 4 NaN
Select原目标行数和列数:
both[both['T']==1][[0,1]]
# 0 1
#0 2016-11-01 00:00:05 0
#1 2016-11-01 00:00:07 0
#2 2016-11-01 00:00:23 3
#3 2016-11-01 00:00:25 3
如果要按点计算目标,请使用 groupby()
:
both.groupby(1).count()['T']
#1
#0 2
#1 0
#2 0
#3 2
#4 0
让我们使用 merge_ordered
、fillna
和 groupby
:
输入:
df_spots
Date Value
0 2016-11-01 00:00:00.000 0
1 2016-11-01 00:00:07.500 1
2 2016-11-01 00:00:15.000 2
3 2016-11-01 00:00:22.500 3
4 2016-11-01 00:00:30.000 4
df_target
Date
0 2016-11-01 00:00:05
1 2016-11-01 00:00:07
2 2016-11-01 00:00:23
3 2016-11-01 00:00:25
代码:
merged_df = pd.merge_ordered(df_spots, df_target, on = 'Date')
df_out = (merged_df.groupby(by=merged_df['Value']
.fillna(method='ffill'), as_index=False)
.agg({'Date':'first',
'Value':{'first':'first','count':lambda x:len(x)-1}}))
输出:
df_out
Date Value
first first count
0 2016-11-01 00:00:00.000 0.0 2.0
1 2016-11-01 00:00:07.500 1.0 0.0
2 2016-11-01 00:00:15.000 2.0 0.0
3 2016-11-01 00:00:22.500 3.0 2.0
4 2016-11-01 00:00:30.000 4.0 0.0
结合使用 np.searchsorted
和 pd.value_counts
以及其他一些东西。
idx = Spots.index.to_series()
i = idx.values
t = Target.Date.values
m = pd.value_counts(i[i.searchsorted(t) - 1]).to_dict()
Spots.assign(TargetCount=idx.map(lambda x: m.get(x, 0)))
Value TargetCount
Date
2016-11-01 00:00:00.000 0 2
2016-11-01 00:00:07.500 1 0
2016-11-01 00:00:15.000 2 0
2016-11-01 00:00:22.500 3 2
2016-11-01 00:00:30.000 4 0
工作原理
idx
是Spots
的索引变成了pd.Series
因为我以后要用pd.Series.map
.i
是基础numpy
数组,我将使用 执行 t
等同于i
...searchsorted
的一部分
searchsorted
将遍历右数组中的每个元素,并找到相对于右数组应该插入该元素的位置。此信息可用于查找元素所属的 "bin"。然后我减去一个以与适当的索引对齐- 然后我执行
pd.value_counts
来计算它们 - 使用
map
建立新栏目。
searchsorted
操作
设置
from io import StringIO
import pandas as pd
tx1 = """2016-11-01 00:00:00,0
2016-11-01 00:00:07.500000,1
2016-11-01 00:00:15,2
2016-11-01 00:00:22.500000,3
2016-11-01 00:00:30,4"""
tx2 = """2016-11-01 00:00:05
2016-11-01 00:00:07
2016-11-01 00:00:23
2016-11-01 00:00:25"""
Spots = pd.read_csv(StringIO(tx1), parse_dates=[0], index_col=0, names=['Date', 'Value'])
Target = pd.read_csv(StringIO(tx2), parse_dates=[0], names=['Date'])
使用pandasmerge_asof (注意,所有时间值都必须排序 - 可能必须先排序):
设置 ~~~~~~~~
import pandas as pd
# make date_range with 1 sec interval (fake targets)
rng = pd.date_range('2016-11-01', periods=100, freq='S')
# resample to make 7.5 sec intervals (fake spot bins)
ts = pd.Series(np.arange(100), index=rng)
ts_vals = ts.resample('7500L').asfreq().index
df_spots = pd.DataFrame({'spot': np.arange(len(ts_vals)), 'bin': ts_vals})
df_spots.head()
bin spot
0 2016-11-01 00:00:00.000 0
1 2016-11-01 00:00:07.500 1
2 2016-11-01 00:00:15.000 2
3 2016-11-01 00:00:22.500 3
4 2016-11-01 00:00:30.000 4
df_targets = pd.DataFrame(rng, columns=['tgt'])
df_targets.head()
tgt
0 2016-11-01 00:00:00
1 2016-11-01 00:00:01
2 2016-11-01 00:00:02
3 2016-11-01 00:00:03
4 2016-11-01 00:00:04
解决方案~~~~~~~
# this will produce spot membership for targets
df = pd.merge_asof(df_targets, df_spots, left_on='tgt', right_on='bin')
df.head()
tgt bin spot
0 2016-11-01 00:00:00 2016-11-01 00:00:00.000 0
1 2016-11-01 00:00:01 2016-11-01 00:00:00.000 0
2 2016-11-01 00:00:02 2016-11-01 00:00:00.000 0
3 2016-11-01 00:00:03 2016-11-01 00:00:00.000 0
4 2016-11-01 00:00:04 2016-11-01 00:00:00.000 0
5 2016-11-01 00:00:05 2016-11-01 00:00:00.000 0
6 2016-11-01 00:00:06 2016-11-01 00:00:00.000 0
7 2016-11-01 00:00:07 2016-11-01 00:00:00.000 0
8 2016-11-01 00:00:08 2016-11-01 00:00:07.500 1
9 2016-11-01 00:00:09 2016-11-01 00:00:07.500 1
# for spot counts...
df_counts = pd.DataFrame(df.groupby('bin')['spot'].count())
df_counts.head()
spot
bin
2016-11-01 00:00:00.000 8
2016-11-01 00:00:07.500 7
2016-11-01 00:00:15.000 8
2016-11-01 00:00:22.500 7
2016-11-01 00:00:30.000 8