使用 "interval" 数据框加入时间序列数据框
Join a time-series dataframe with an "interval" dataframe
我正在努力将间隔数据帧中的数据连接到时间序列数据帧。
对于我的时间序列的每一行,我想查看它包含在哪个间隔中以及 return 间隔数据帧中的特定值。
我受到这个解决方案的启发:
但据我所知,由于过于复杂的原因,它无法正常工作。
这是我的错误信息:
KeyError Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_13072/1034504056.py in <module>
1 #df_test.index = pd.IntervalIndex.from_arrays(df_test['Start'],df_test['End'],closed='both')
----> 2 data_test['Product'] = data_test.index.to_series().apply(lambda x : df_test.iloc[df_test.index.get_loc(x)]['Product'])
~\Anaconda3\lib\site-packages\pandas\core\series.py in apply(self, func, convert_dtype, args, **kwargs)
4355 dtype: float64
4356 """
-> 4357 return SeriesApply(self, func, convert_dtype, args, kwargs).apply()
4358
4359 def _reduce(
~\Anaconda3\lib\site-packages\pandas\core\apply.py in apply(self)
1041 return self.apply_str()
1042
-> 1043 return self.apply_standard()
1044
1045 def agg(self):
~\Anaconda3\lib\site-packages\pandas\core\apply.py in apply_standard(self)
1097 # List[Union[Callable[..., Any], str]]]]]"; expected
1098 # "Callable[[Any], Any]"
-> 1099 mapped = lib.map_infer(
1100 values,
1101 f, # type: ignore[arg-type]
~\Anaconda3\lib\site-packages\pandas\_libs\lib.pyx in pandas._libs.lib.map_infer()
~\AppData\Local\Temp/ipykernel_13072/1034504056.py in <lambda>(x)
1 #df_test.index = pd.IntervalIndex.from_arrays(df_test['Heure début réelle'],df_test['Hre fin réelle'],closed='both')
----> 2 data_test['Designation'] = data_test.index.to_series().apply(lambda x : df_test.iloc[df_test.index.get_loc(x)]['Désignation article'])
~\Anaconda3\lib\site-packages\pandas\core\indexes\interval.py in get_loc(self, key, method, tolerance)
631 matches = mask.sum()
632 if matches == 0:
--> 633 raise KeyError(key)
634 elif matches == 1:
635 return mask.argmax()
KeyError: Timestamp('2021-10-23 23:59:29')
我要成功的功能。
df_test.index = pd.IntervalIndex.from_arrays(df_test['Start'],df_test['End'],closed='both')
data_test['Product'] = data_test.index.to_series().apply(lambda x : df_test.iloc[df_test.index.get_loc(x)]['Product'])
data_test
的示例值
{'Ordre': {92: 3149484,
93: 3149484,
94: 3149484,
95: 3149610,
96: 3149610,
97: 3149610,
98: 3149610,
99: 3149610,
100: 3149610,
101: 3149610,
102: 3149611},
'Start': {92: Timestamp('2021-10-26 06:55:00'),
93: Timestamp('2021-10-26 06:55:00'),
94: Timestamp('2021-10-26 06:55:00'),
95: Timestamp('2021-10-26 07:25:00'),
96: Timestamp('2021-10-26 07:25:00'),
97: Timestamp('2021-10-26 07:25:00'),
98: Timestamp('2021-10-26 08:30:00'),
99: Timestamp('2021-10-26 08:30:00'),
100: Timestamp('2021-10-26 08:30:00'),
101: Timestamp('2021-10-26 08:30:00'),
102: Timestamp('2021-10-26 11:37:00')},
'End': {92: Timestamp('2021-10-26 07:25:00'),
93: Timestamp('2021-10-26 07:25:00'),
94: Timestamp('2021-10-26 07:25:00'),
95: Timestamp('2021-10-26 08:30:00'),
96: Timestamp('2021-10-26 08:30:00'),
97: Timestamp('2021-10-26 08:30:00'),
98: Timestamp('2021-10-26 11:37:00'),
99: Timestamp('2021-10-26 11:37:00'),
100: Timestamp('2021-10-26 11:37:00'),
101: Timestamp('2021-10-26 11:37:00'),
102: Timestamp('2021-10-26 12:30:00')},
'Product': {92: 'Product_1',
93: 'Product_1',
94: 'Product_1',
95: 'Product_2',
96: 'Product_2',
97: 'Product_2',
98: 'Product_2',
99: 'Product_2',
100: 'Product_2',
101: 'Product_2',
102: 'Product_2'}}
df_test
的示例值
{'Temperature_1': {Timestamp('2021-10-26 06:55:29'): 62.9905242919922,
Timestamp('2021-10-26 06:56:29'): 62.9905242919922,
Timestamp('2021-10-26 06:57:29'): 62.9905242919922,
Timestamp('2021-10-26 06:58:29'): 62.9905242919922,
Timestamp('2021-10-26 06:59:29'): 62.9905242919922,
Timestamp('2021-10-26 08:25:29'): 65.0611953735352,
Timestamp('2021-10-26 08:26:29'): 65.0611953735352,
Timestamp('2021-10-26 08:27:29'): 65.0611953735352,
Timestamp('2021-10-26 08:28:29'): 65.0611953735352,
Timestamp('2021-10-26 08:29:29'): 65.0611953735352},
'Temperature_2': {Timestamp('2021-10-26 06:55:29'): 66.8290863037109,
Timestamp('2021-10-26 06:56:29'): 66.8290863037109,
Timestamp('2021-10-26 06:57:29'): 66.8290863037109,
Timestamp('2021-10-26 06:58:29'): 66.8290863037109,
Timestamp('2021-10-26 06:59:29'): 66.8290863037109,
Timestamp('2021-10-26 08:25:29'): 67.0449523925781,
Timestamp('2021-10-26 08:26:29'): 67.0449523925781,
Timestamp('2021-10-26 08:27:29'): 67.0449523925781,
Timestamp('2021-10-26 08:28:29'): 66.0404281616211,
Timestamp('2021-10-26 08:29:29'): 66.0404281616211}}
输出将是一个新列,指示哪个产品关注时间戳是否包含在时间间隔中:
{'Temperature_1': {Timestamp('2021-10-26 06:55:29'): 62.9905242919922,
Timestamp('2021-10-26 06:56:29'): 62.9905242919922,
Timestamp('2021-10-26 06:57:29'): 62.9905242919922,
Timestamp('2021-10-26 06:58:29'): 62.9905242919922,
Timestamp('2021-10-26 06:59:29'): 62.9905242919922,
Timestamp('2021-10-26 08:25:29'): 65.0611953735352,
Timestamp('2021-10-26 08:26:29'): 65.0611953735352,
Timestamp('2021-10-26 08:27:29'): 65.0611953735352,
Timestamp('2021-10-26 08:28:29'): 65.0611953735352,
Timestamp('2021-10-26 08:29:29'): 65.0611953735352},
'Temperature_2': {Timestamp('2021-10-26 06:55:29'): 66.8290863037109,
Timestamp('2021-10-26 06:56:29'): 66.8290863037109,
Timestamp('2021-10-26 06:57:29'): 66.8290863037109,
Timestamp('2021-10-26 06:58:29'): 66.8290863037109,
Timestamp('2021-10-26 06:59:29'): 66.8290863037109,
Timestamp('2021-10-26 08:25:29'): 67.0449523925781,
Timestamp('2021-10-26 08:26:29'): 67.0449523925781,
Timestamp('2021-10-26 08:27:29'): 67.0449523925781,
Timestamp('2021-10-26 08:28:29'): 66.0404281616211,
Timestamp('2021-10-26 08:29:29'): 66.0404281616211},
'Product': {Timestamp('2021-10-26 06:55:29'): 'Product_1',
Timestamp('2021-10-26 06:56:29'): 'Product_1',
Timestamp('2021-10-26 06:57:29'): 'Product_1',
Timestamp('2021-10-26 06:58:29'): 'Product_1',
Timestamp('2021-10-26 06:59:29'): 'Product_1',
Timestamp('2021-10-26 08:25:29'): 'Product_2',
Timestamp('2021-10-26 08:26:29'): 'Product_2',
Timestamp('2021-10-26 08:27:29'): 'Product_2',
Timestamp('2021-10-26 08:28:29'): 'Product_2',
Timestamp('2021-10-26 08:29:29'): 'Product_2'}}
一组新数据
data_test = {'Ordre': {53: 3147783, 54: 3147783, 55: 3147783, 56: 3147783, 57: 3147783},
'Start': {53: Timestamp('2021-10-24 20:35:00'),
54: Timestamp('2021-10-24 20:35:00'),
55: Timestamp('2021-10-25 00:01:00'),
56: Timestamp('2021-10-25 00:01:00'),
57: Timestamp('2021-10-25 00:01:00')},
'End': {53: Timestamp('2021-10-24 23:59:00'),
54: Timestamp('2021-10-24 23:59:00'),
55: Timestamp('2021-10-25 04:27:00'),
56: Timestamp('2021-10-25 04:27:00'),
57: Timestamp('2021-10-25 04:27:00')},
'Product': {53: 'Product_1',
54: 'Product_1',
55: 'Product_1',
56: 'Product_1',
57: 'Product_1'}}
df_test = {'Temperature_1': {Timestamp('2021-10-24 23:55:00'): 48.0,
Timestamp('2021-10-24 23:56:00'): 48.0,
Timestamp('2021-10-24 23:57:00'): 48.0,
Timestamp('2021-10-24 23:58:00'): 48.0,
Timestamp('2021-10-24 23:59:00'): 48.0,
Timestamp('2021-10-25 00:00:00'): 48.0,
Timestamp('2021-10-25 00:01:00'): 48.0,
Timestamp('2021-10-25 00:02:00'): 48.0},
'Temperature_2': {Timestamp('2021-10-24 23:55:00'): 60.0,
Timestamp('2021-10-24 23:56:00'): 60.0,
Timestamp('2021-10-24 23:57:00'): 60.0,
Timestamp('2021-10-24 23:58:00'): 60.0,
Timestamp('2021-10-24 23:59:00'): 60.0,
Timestamp('2021-10-25 00:00:00'): 59.0,
Timestamp('2021-10-25 00:01:00'): 59.0,
Timestamp('2021-10-25 00:02:00'): 59.0}}
感谢您的帮助和建议
间隔应该在 data_test 上创建,而不是 df_test。此外,您的 data_test 有重复项:
data_test = data_test.drop_duplicates()
data_test.index = pd.IntervalIndex.from_arrays(data_test['Start'],
data_test['End'],
closed='both')
product = (df_test
.index
.to_series()
.apply(lambda df: data_test.iloc[data_test.index.get_loc(df),
data_test.columns.get_loc('Product')])
)
df_test.assign(Product = product)
Temperature_1 Temperature_2 Product
2021-10-26 06:55:29 62.990524 66.829086 Product_1
2021-10-26 06:56:29 62.990524 66.829086 Product_1
2021-10-26 06:57:29 62.990524 66.829086 Product_1
2021-10-26 06:58:29 62.990524 66.829086 Product_1
2021-10-26 06:59:29 62.990524 66.829086 Product_1
2021-10-26 08:25:29 65.061195 67.044952 Product_2
2021-10-26 08:26:29 65.061195 67.044952 Product_2
2021-10-26 08:27:29 65.061195 67.044952 Product_2
2021-10-26 08:28:29 65.061195 66.040428 Product_2
2021-10-26 08:29:29 65.061195 66.040428 Product_2
对于更新的数据,你是对的,如果一个值不在间隔内,它将失败。有其他方法可以解决这个问题:
一种选择是使用 conditional_join from pyjanitor,这有助于抽象不等式连接:
# pip install pyjanitor
import pandas as pd
import janitor
data_test = pd.DataFrame(data_test)
df_test = pd.DataFrame(df_test)
df_test.index.name = 'Timestamp'
(df_test
.reset_index()
.conditional_join(
data_test,
('Timestamp', 'Start', '>='),
('Timestamp', 'End', '<='), how = 'left')
.loc[:, ['Timestamp', 'Temperature_1', 'Temperature_2', 'Product']]
.set_index('Timestamp')
)
Temperature_1 Temperature_2 Product
Timestamp
2021-10-24 23:55:00 48.0 60.0 Product_1
2021-10-24 23:55:00 48.0 60.0 Product_1
2021-10-24 23:56:00 48.0 60.0 Product_1
2021-10-24 23:56:00 48.0 60.0 Product_1
2021-10-24 23:57:00 48.0 60.0 Product_1
2021-10-24 23:57:00 48.0 60.0 Product_1
2021-10-24 23:58:00 48.0 60.0 Product_1
2021-10-24 23:58:00 48.0 60.0 Product_1
2021-10-24 23:59:00 48.0 60.0 Product_1
2021-10-24 23:59:00 48.0 60.0 Product_1
2021-10-25 00:00:00 48.0 59.0 NaN
2021-10-25 00:01:00 48.0 59.0 Product_1
2021-10-25 00:01:00 48.0 59.0 Product_1
2021-10-25 00:01:00 48.0 59.0 Product_1
2021-10-25 00:02:00 48.0 59.0 Product_1
2021-10-25 00:02:00 48.0 59.0 Product_1
2021-10-25 00:02:00 48.0 59.0 Product_1
另一个选项涉及IntervalIndex;但是,我们不使用 apply,而是使用 for 循环(apply 是一种 for 循环):
# start afresh
data_test = pd.DataFrame(data_test)
df_test = pd.DataFrame(df_test)
# build the intervals
intervals = pd.IntervalIndex.from_arrays(data_test['Start'],
data_test['End'],
closed='both')
data_test.index = intervals
values = {}
# create dictionary of values found in the intervals
for val in df_test.index:
present = intervals.contains(val)
if present.any(): # we found something!
values[val] = intervals[present]
values = pd.Series(values).explode()
# reindex and create a temporary column
df_test.loc[values.index, 'intervals'] = values.array
# use the temporary column to merge
(df_test
.merge(data_test.Product,
left_on='intervals',
right_index = True,
how = 'left')
.drop(columns='intervals')
)
Temperature_1 Temperature_2 Product
2021-10-24 23:55:00 48.0 60.0 Product_1
2021-10-24 23:55:00 48.0 60.0 Product_1
2021-10-24 23:56:00 48.0 60.0 Product_1
2021-10-24 23:56:00 48.0 60.0 Product_1
2021-10-24 23:57:00 48.0 60.0 Product_1
2021-10-24 23:57:00 48.0 60.0 Product_1
2021-10-24 23:58:00 48.0 60.0 Product_1
2021-10-24 23:58:00 48.0 60.0 Product_1
2021-10-24 23:59:00 48.0 60.0 Product_1
2021-10-24 23:59:00 48.0 60.0 Product_1
2021-10-25 00:00:00 48.0 59.0 NaN
2021-10-25 00:01:00 48.0 59.0 Product_1
2021-10-25 00:01:00 48.0 59.0 Product_1
2021-10-25 00:01:00 48.0 59.0 Product_1
2021-10-25 00:02:00 48.0 59.0 Product_1
2021-10-25 00:02:00 48.0 59.0 Product_1
2021-10-25 00:02:00 48.0 59.0 Product_1
我正在努力将间隔数据帧中的数据连接到时间序列数据帧。 对于我的时间序列的每一行,我想查看它包含在哪个间隔中以及 return 间隔数据帧中的特定值。
我受到这个解决方案的启发:
但据我所知,由于过于复杂的原因,它无法正常工作。
这是我的错误信息:
KeyError Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_13072/1034504056.py in <module>
1 #df_test.index = pd.IntervalIndex.from_arrays(df_test['Start'],df_test['End'],closed='both')
----> 2 data_test['Product'] = data_test.index.to_series().apply(lambda x : df_test.iloc[df_test.index.get_loc(x)]['Product'])
~\Anaconda3\lib\site-packages\pandas\core\series.py in apply(self, func, convert_dtype, args, **kwargs)
4355 dtype: float64
4356 """
-> 4357 return SeriesApply(self, func, convert_dtype, args, kwargs).apply()
4358
4359 def _reduce(
~\Anaconda3\lib\site-packages\pandas\core\apply.py in apply(self)
1041 return self.apply_str()
1042
-> 1043 return self.apply_standard()
1044
1045 def agg(self):
~\Anaconda3\lib\site-packages\pandas\core\apply.py in apply_standard(self)
1097 # List[Union[Callable[..., Any], str]]]]]"; expected
1098 # "Callable[[Any], Any]"
-> 1099 mapped = lib.map_infer(
1100 values,
1101 f, # type: ignore[arg-type]
~\Anaconda3\lib\site-packages\pandas\_libs\lib.pyx in pandas._libs.lib.map_infer()
~\AppData\Local\Temp/ipykernel_13072/1034504056.py in <lambda>(x)
1 #df_test.index = pd.IntervalIndex.from_arrays(df_test['Heure début réelle'],df_test['Hre fin réelle'],closed='both')
----> 2 data_test['Designation'] = data_test.index.to_series().apply(lambda x : df_test.iloc[df_test.index.get_loc(x)]['Désignation article'])
~\Anaconda3\lib\site-packages\pandas\core\indexes\interval.py in get_loc(self, key, method, tolerance)
631 matches = mask.sum()
632 if matches == 0:
--> 633 raise KeyError(key)
634 elif matches == 1:
635 return mask.argmax()
KeyError: Timestamp('2021-10-23 23:59:29')
我要成功的功能。
df_test.index = pd.IntervalIndex.from_arrays(df_test['Start'],df_test['End'],closed='both')
data_test['Product'] = data_test.index.to_series().apply(lambda x : df_test.iloc[df_test.index.get_loc(x)]['Product'])
data_test
的示例值{'Ordre': {92: 3149484,
93: 3149484,
94: 3149484,
95: 3149610,
96: 3149610,
97: 3149610,
98: 3149610,
99: 3149610,
100: 3149610,
101: 3149610,
102: 3149611},
'Start': {92: Timestamp('2021-10-26 06:55:00'),
93: Timestamp('2021-10-26 06:55:00'),
94: Timestamp('2021-10-26 06:55:00'),
95: Timestamp('2021-10-26 07:25:00'),
96: Timestamp('2021-10-26 07:25:00'),
97: Timestamp('2021-10-26 07:25:00'),
98: Timestamp('2021-10-26 08:30:00'),
99: Timestamp('2021-10-26 08:30:00'),
100: Timestamp('2021-10-26 08:30:00'),
101: Timestamp('2021-10-26 08:30:00'),
102: Timestamp('2021-10-26 11:37:00')},
'End': {92: Timestamp('2021-10-26 07:25:00'),
93: Timestamp('2021-10-26 07:25:00'),
94: Timestamp('2021-10-26 07:25:00'),
95: Timestamp('2021-10-26 08:30:00'),
96: Timestamp('2021-10-26 08:30:00'),
97: Timestamp('2021-10-26 08:30:00'),
98: Timestamp('2021-10-26 11:37:00'),
99: Timestamp('2021-10-26 11:37:00'),
100: Timestamp('2021-10-26 11:37:00'),
101: Timestamp('2021-10-26 11:37:00'),
102: Timestamp('2021-10-26 12:30:00')},
'Product': {92: 'Product_1',
93: 'Product_1',
94: 'Product_1',
95: 'Product_2',
96: 'Product_2',
97: 'Product_2',
98: 'Product_2',
99: 'Product_2',
100: 'Product_2',
101: 'Product_2',
102: 'Product_2'}}
df_test
的示例值{'Temperature_1': {Timestamp('2021-10-26 06:55:29'): 62.9905242919922,
Timestamp('2021-10-26 06:56:29'): 62.9905242919922,
Timestamp('2021-10-26 06:57:29'): 62.9905242919922,
Timestamp('2021-10-26 06:58:29'): 62.9905242919922,
Timestamp('2021-10-26 06:59:29'): 62.9905242919922,
Timestamp('2021-10-26 08:25:29'): 65.0611953735352,
Timestamp('2021-10-26 08:26:29'): 65.0611953735352,
Timestamp('2021-10-26 08:27:29'): 65.0611953735352,
Timestamp('2021-10-26 08:28:29'): 65.0611953735352,
Timestamp('2021-10-26 08:29:29'): 65.0611953735352},
'Temperature_2': {Timestamp('2021-10-26 06:55:29'): 66.8290863037109,
Timestamp('2021-10-26 06:56:29'): 66.8290863037109,
Timestamp('2021-10-26 06:57:29'): 66.8290863037109,
Timestamp('2021-10-26 06:58:29'): 66.8290863037109,
Timestamp('2021-10-26 06:59:29'): 66.8290863037109,
Timestamp('2021-10-26 08:25:29'): 67.0449523925781,
Timestamp('2021-10-26 08:26:29'): 67.0449523925781,
Timestamp('2021-10-26 08:27:29'): 67.0449523925781,
Timestamp('2021-10-26 08:28:29'): 66.0404281616211,
Timestamp('2021-10-26 08:29:29'): 66.0404281616211}}
输出将是一个新列,指示哪个产品关注时间戳是否包含在时间间隔中:
{'Temperature_1': {Timestamp('2021-10-26 06:55:29'): 62.9905242919922,
Timestamp('2021-10-26 06:56:29'): 62.9905242919922,
Timestamp('2021-10-26 06:57:29'): 62.9905242919922,
Timestamp('2021-10-26 06:58:29'): 62.9905242919922,
Timestamp('2021-10-26 06:59:29'): 62.9905242919922,
Timestamp('2021-10-26 08:25:29'): 65.0611953735352,
Timestamp('2021-10-26 08:26:29'): 65.0611953735352,
Timestamp('2021-10-26 08:27:29'): 65.0611953735352,
Timestamp('2021-10-26 08:28:29'): 65.0611953735352,
Timestamp('2021-10-26 08:29:29'): 65.0611953735352},
'Temperature_2': {Timestamp('2021-10-26 06:55:29'): 66.8290863037109,
Timestamp('2021-10-26 06:56:29'): 66.8290863037109,
Timestamp('2021-10-26 06:57:29'): 66.8290863037109,
Timestamp('2021-10-26 06:58:29'): 66.8290863037109,
Timestamp('2021-10-26 06:59:29'): 66.8290863037109,
Timestamp('2021-10-26 08:25:29'): 67.0449523925781,
Timestamp('2021-10-26 08:26:29'): 67.0449523925781,
Timestamp('2021-10-26 08:27:29'): 67.0449523925781,
Timestamp('2021-10-26 08:28:29'): 66.0404281616211,
Timestamp('2021-10-26 08:29:29'): 66.0404281616211},
'Product': {Timestamp('2021-10-26 06:55:29'): 'Product_1',
Timestamp('2021-10-26 06:56:29'): 'Product_1',
Timestamp('2021-10-26 06:57:29'): 'Product_1',
Timestamp('2021-10-26 06:58:29'): 'Product_1',
Timestamp('2021-10-26 06:59:29'): 'Product_1',
Timestamp('2021-10-26 08:25:29'): 'Product_2',
Timestamp('2021-10-26 08:26:29'): 'Product_2',
Timestamp('2021-10-26 08:27:29'): 'Product_2',
Timestamp('2021-10-26 08:28:29'): 'Product_2',
Timestamp('2021-10-26 08:29:29'): 'Product_2'}}
一组新数据
data_test = {'Ordre': {53: 3147783, 54: 3147783, 55: 3147783, 56: 3147783, 57: 3147783},
'Start': {53: Timestamp('2021-10-24 20:35:00'),
54: Timestamp('2021-10-24 20:35:00'),
55: Timestamp('2021-10-25 00:01:00'),
56: Timestamp('2021-10-25 00:01:00'),
57: Timestamp('2021-10-25 00:01:00')},
'End': {53: Timestamp('2021-10-24 23:59:00'),
54: Timestamp('2021-10-24 23:59:00'),
55: Timestamp('2021-10-25 04:27:00'),
56: Timestamp('2021-10-25 04:27:00'),
57: Timestamp('2021-10-25 04:27:00')},
'Product': {53: 'Product_1',
54: 'Product_1',
55: 'Product_1',
56: 'Product_1',
57: 'Product_1'}}
df_test = {'Temperature_1': {Timestamp('2021-10-24 23:55:00'): 48.0,
Timestamp('2021-10-24 23:56:00'): 48.0,
Timestamp('2021-10-24 23:57:00'): 48.0,
Timestamp('2021-10-24 23:58:00'): 48.0,
Timestamp('2021-10-24 23:59:00'): 48.0,
Timestamp('2021-10-25 00:00:00'): 48.0,
Timestamp('2021-10-25 00:01:00'): 48.0,
Timestamp('2021-10-25 00:02:00'): 48.0},
'Temperature_2': {Timestamp('2021-10-24 23:55:00'): 60.0,
Timestamp('2021-10-24 23:56:00'): 60.0,
Timestamp('2021-10-24 23:57:00'): 60.0,
Timestamp('2021-10-24 23:58:00'): 60.0,
Timestamp('2021-10-24 23:59:00'): 60.0,
Timestamp('2021-10-25 00:00:00'): 59.0,
Timestamp('2021-10-25 00:01:00'): 59.0,
Timestamp('2021-10-25 00:02:00'): 59.0}}
感谢您的帮助和建议
间隔应该在 data_test 上创建,而不是 df_test。此外,您的 data_test 有重复项:
data_test = data_test.drop_duplicates()
data_test.index = pd.IntervalIndex.from_arrays(data_test['Start'],
data_test['End'],
closed='both')
product = (df_test
.index
.to_series()
.apply(lambda df: data_test.iloc[data_test.index.get_loc(df),
data_test.columns.get_loc('Product')])
)
df_test.assign(Product = product)
Temperature_1 Temperature_2 Product
2021-10-26 06:55:29 62.990524 66.829086 Product_1
2021-10-26 06:56:29 62.990524 66.829086 Product_1
2021-10-26 06:57:29 62.990524 66.829086 Product_1
2021-10-26 06:58:29 62.990524 66.829086 Product_1
2021-10-26 06:59:29 62.990524 66.829086 Product_1
2021-10-26 08:25:29 65.061195 67.044952 Product_2
2021-10-26 08:26:29 65.061195 67.044952 Product_2
2021-10-26 08:27:29 65.061195 67.044952 Product_2
2021-10-26 08:28:29 65.061195 66.040428 Product_2
2021-10-26 08:29:29 65.061195 66.040428 Product_2
对于更新的数据,你是对的,如果一个值不在间隔内,它将失败。有其他方法可以解决这个问题:
一种选择是使用 conditional_join from pyjanitor,这有助于抽象不等式连接:
# pip install pyjanitor
import pandas as pd
import janitor
data_test = pd.DataFrame(data_test)
df_test = pd.DataFrame(df_test)
df_test.index.name = 'Timestamp'
(df_test
.reset_index()
.conditional_join(
data_test,
('Timestamp', 'Start', '>='),
('Timestamp', 'End', '<='), how = 'left')
.loc[:, ['Timestamp', 'Temperature_1', 'Temperature_2', 'Product']]
.set_index('Timestamp')
)
Temperature_1 Temperature_2 Product
Timestamp
2021-10-24 23:55:00 48.0 60.0 Product_1
2021-10-24 23:55:00 48.0 60.0 Product_1
2021-10-24 23:56:00 48.0 60.0 Product_1
2021-10-24 23:56:00 48.0 60.0 Product_1
2021-10-24 23:57:00 48.0 60.0 Product_1
2021-10-24 23:57:00 48.0 60.0 Product_1
2021-10-24 23:58:00 48.0 60.0 Product_1
2021-10-24 23:58:00 48.0 60.0 Product_1
2021-10-24 23:59:00 48.0 60.0 Product_1
2021-10-24 23:59:00 48.0 60.0 Product_1
2021-10-25 00:00:00 48.0 59.0 NaN
2021-10-25 00:01:00 48.0 59.0 Product_1
2021-10-25 00:01:00 48.0 59.0 Product_1
2021-10-25 00:01:00 48.0 59.0 Product_1
2021-10-25 00:02:00 48.0 59.0 Product_1
2021-10-25 00:02:00 48.0 59.0 Product_1
2021-10-25 00:02:00 48.0 59.0 Product_1
另一个选项涉及IntervalIndex;但是,我们不使用 apply,而是使用 for 循环(apply 是一种 for 循环):
# start afresh
data_test = pd.DataFrame(data_test)
df_test = pd.DataFrame(df_test)
# build the intervals
intervals = pd.IntervalIndex.from_arrays(data_test['Start'],
data_test['End'],
closed='both')
data_test.index = intervals
values = {}
# create dictionary of values found in the intervals
for val in df_test.index:
present = intervals.contains(val)
if present.any(): # we found something!
values[val] = intervals[present]
values = pd.Series(values).explode()
# reindex and create a temporary column
df_test.loc[values.index, 'intervals'] = values.array
# use the temporary column to merge
(df_test
.merge(data_test.Product,
left_on='intervals',
right_index = True,
how = 'left')
.drop(columns='intervals')
)
Temperature_1 Temperature_2 Product
2021-10-24 23:55:00 48.0 60.0 Product_1
2021-10-24 23:55:00 48.0 60.0 Product_1
2021-10-24 23:56:00 48.0 60.0 Product_1
2021-10-24 23:56:00 48.0 60.0 Product_1
2021-10-24 23:57:00 48.0 60.0 Product_1
2021-10-24 23:57:00 48.0 60.0 Product_1
2021-10-24 23:58:00 48.0 60.0 Product_1
2021-10-24 23:58:00 48.0 60.0 Product_1
2021-10-24 23:59:00 48.0 60.0 Product_1
2021-10-24 23:59:00 48.0 60.0 Product_1
2021-10-25 00:00:00 48.0 59.0 NaN
2021-10-25 00:01:00 48.0 59.0 Product_1
2021-10-25 00:01:00 48.0 59.0 Product_1
2021-10-25 00:01:00 48.0 59.0 Product_1
2021-10-25 00:02:00 48.0 59.0 Product_1
2021-10-25 00:02:00 48.0 59.0 Product_1
2021-10-25 00:02:00 48.0 59.0 Product_1