使用 "interval" 数据框加入时间序列数据框

Question

我正在努力将间隔数据帧中的数据连接到时间序列数据帧。对于我的时间序列的每一行，我想查看它包含在哪个间隔中以及 return 间隔数据帧中的特定值。

我受到这个解决方案的启发：

但据我所知，由于过于复杂的原因，它无法正常工作。

这是我的错误信息：

KeyError                                  Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_13072/1034504056.py in <module>
      1 #df_test.index = pd.IntervalIndex.from_arrays(df_test['Start'],df_test['End'],closed='both')
----> 2 data_test['Product'] = data_test.index.to_series().apply(lambda x : df_test.iloc[df_test.index.get_loc(x)]['Product'])

~\Anaconda3\lib\site-packages\pandas\core\series.py in apply(self, func, convert_dtype, args, **kwargs)
   4355         dtype: float64
   4356         """
-> 4357         return SeriesApply(self, func, convert_dtype, args, kwargs).apply()
   4358 
   4359     def _reduce(

~\Anaconda3\lib\site-packages\pandas\core\apply.py in apply(self)
   1041             return self.apply_str()
   1042 
-> 1043         return self.apply_standard()
   1044 
   1045     def agg(self):

~\Anaconda3\lib\site-packages\pandas\core\apply.py in apply_standard(self)
   1097                 # List[Union[Callable[..., Any], str]]]]]"; expected
   1098                 # "Callable[[Any], Any]"
-> 1099                 mapped = lib.map_infer(
   1100                     values,
   1101                     f,  # type: ignore[arg-type]

~\Anaconda3\lib\site-packages\pandas\_libs\lib.pyx in pandas._libs.lib.map_infer()

~\AppData\Local\Temp/ipykernel_13072/1034504056.py in <lambda>(x)
      1 #df_test.index = pd.IntervalIndex.from_arrays(df_test['Heure début réelle'],df_test['Hre fin réelle'],closed='both')
----> 2 data_test['Designation'] = data_test.index.to_series().apply(lambda x : df_test.iloc[df_test.index.get_loc(x)]['Désignation article'])

~\Anaconda3\lib\site-packages\pandas\core\indexes\interval.py in get_loc(self, key, method, tolerance)
    631         matches = mask.sum()
    632         if matches == 0:
--> 633             raise KeyError(key)
    634         elif matches == 1:
    635             return mask.argmax()

KeyError: Timestamp('2021-10-23 23:59:29')

我要成功的功能。

df_test.index = pd.IntervalIndex.from_arrays(df_test['Start'],df_test['End'],closed='both')
data_test['Product'] = data_test.index.to_series().apply(lambda x : df_test.iloc[df_test.index.get_loc(x)]['Product'])

data_test

的示例值

{'Ordre': {92: 3149484,
  93: 3149484,
  94: 3149484,
  95: 3149610,
  96: 3149610,
  97: 3149610,
  98: 3149610,
  99: 3149610,
  100: 3149610,
  101: 3149610,
  102: 3149611},
 'Start': {92: Timestamp('2021-10-26 06:55:00'),
  93: Timestamp('2021-10-26 06:55:00'),
  94: Timestamp('2021-10-26 06:55:00'),
  95: Timestamp('2021-10-26 07:25:00'),
  96: Timestamp('2021-10-26 07:25:00'),
  97: Timestamp('2021-10-26 07:25:00'),
  98: Timestamp('2021-10-26 08:30:00'),
  99: Timestamp('2021-10-26 08:30:00'),
  100: Timestamp('2021-10-26 08:30:00'),
  101: Timestamp('2021-10-26 08:30:00'),
  102: Timestamp('2021-10-26 11:37:00')},
 'End': {92: Timestamp('2021-10-26 07:25:00'),
  93: Timestamp('2021-10-26 07:25:00'),
  94: Timestamp('2021-10-26 07:25:00'),
  95: Timestamp('2021-10-26 08:30:00'),
  96: Timestamp('2021-10-26 08:30:00'),
  97: Timestamp('2021-10-26 08:30:00'),
  98: Timestamp('2021-10-26 11:37:00'),
  99: Timestamp('2021-10-26 11:37:00'),
  100: Timestamp('2021-10-26 11:37:00'),
  101: Timestamp('2021-10-26 11:37:00'),
  102: Timestamp('2021-10-26 12:30:00')},
 'Product': {92: 'Product_1',
  93: 'Product_1',
  94: 'Product_1',
  95: 'Product_2',
  96: 'Product_2',
  97: 'Product_2',
  98: 'Product_2',
  99: 'Product_2',
  100: 'Product_2',
  101: 'Product_2',
  102: 'Product_2'}}

df_test

的示例值

{'Temperature_1': {Timestamp('2021-10-26 06:55:29'): 62.9905242919922,
  Timestamp('2021-10-26 06:56:29'): 62.9905242919922,
  Timestamp('2021-10-26 06:57:29'): 62.9905242919922,
  Timestamp('2021-10-26 06:58:29'): 62.9905242919922,
  Timestamp('2021-10-26 06:59:29'): 62.9905242919922,
  Timestamp('2021-10-26 08:25:29'): 65.0611953735352,
  Timestamp('2021-10-26 08:26:29'): 65.0611953735352,
  Timestamp('2021-10-26 08:27:29'): 65.0611953735352,
  Timestamp('2021-10-26 08:28:29'): 65.0611953735352,
  Timestamp('2021-10-26 08:29:29'): 65.0611953735352},
 'Temperature_2': {Timestamp('2021-10-26 06:55:29'): 66.8290863037109,
  Timestamp('2021-10-26 06:56:29'): 66.8290863037109,
  Timestamp('2021-10-26 06:57:29'): 66.8290863037109,
  Timestamp('2021-10-26 06:58:29'): 66.8290863037109,
  Timestamp('2021-10-26 06:59:29'): 66.8290863037109,
  Timestamp('2021-10-26 08:25:29'): 67.0449523925781,
  Timestamp('2021-10-26 08:26:29'): 67.0449523925781,
  Timestamp('2021-10-26 08:27:29'): 67.0449523925781,
  Timestamp('2021-10-26 08:28:29'): 66.0404281616211,
  Timestamp('2021-10-26 08:29:29'): 66.0404281616211}}

输出将是一个新列，指示哪个产品关注时间戳是否包含在时间间隔中：

{'Temperature_1': {Timestamp('2021-10-26 06:55:29'): 62.9905242919922,
  Timestamp('2021-10-26 06:56:29'): 62.9905242919922,
  Timestamp('2021-10-26 06:57:29'): 62.9905242919922,
  Timestamp('2021-10-26 06:58:29'): 62.9905242919922,
  Timestamp('2021-10-26 06:59:29'): 62.9905242919922,
  Timestamp('2021-10-26 08:25:29'): 65.0611953735352,
  Timestamp('2021-10-26 08:26:29'): 65.0611953735352,
  Timestamp('2021-10-26 08:27:29'): 65.0611953735352,
  Timestamp('2021-10-26 08:28:29'): 65.0611953735352,
  Timestamp('2021-10-26 08:29:29'): 65.0611953735352},
 'Temperature_2': {Timestamp('2021-10-26 06:55:29'): 66.8290863037109,
  Timestamp('2021-10-26 06:56:29'): 66.8290863037109,
  Timestamp('2021-10-26 06:57:29'): 66.8290863037109,
  Timestamp('2021-10-26 06:58:29'): 66.8290863037109,
  Timestamp('2021-10-26 06:59:29'): 66.8290863037109,
  Timestamp('2021-10-26 08:25:29'): 67.0449523925781,
  Timestamp('2021-10-26 08:26:29'): 67.0449523925781,
  Timestamp('2021-10-26 08:27:29'): 67.0449523925781,
  Timestamp('2021-10-26 08:28:29'): 66.0404281616211,
  Timestamp('2021-10-26 08:29:29'): 66.0404281616211},
'Product': {Timestamp('2021-10-26 06:55:29'): 'Product_1',
  Timestamp('2021-10-26 06:56:29'): 'Product_1',
  Timestamp('2021-10-26 06:57:29'): 'Product_1',
  Timestamp('2021-10-26 06:58:29'): 'Product_1',
  Timestamp('2021-10-26 06:59:29'): 'Product_1',
  Timestamp('2021-10-26 08:25:29'): 'Product_2',
  Timestamp('2021-10-26 08:26:29'): 'Product_2',
  Timestamp('2021-10-26 08:27:29'): 'Product_2',
  Timestamp('2021-10-26 08:28:29'): 'Product_2',
  Timestamp('2021-10-26 08:29:29'): 'Product_2'}}

一组新数据

data_test = {'Ordre': {53: 3147783, 54: 3147783, 55: 3147783, 56: 3147783, 57: 3147783},
 'Start': {53: Timestamp('2021-10-24 20:35:00'),
  54: Timestamp('2021-10-24 20:35:00'),
  55: Timestamp('2021-10-25 00:01:00'),
  56: Timestamp('2021-10-25 00:01:00'),
  57: Timestamp('2021-10-25 00:01:00')},
 'End': {53: Timestamp('2021-10-24 23:59:00'),
  54: Timestamp('2021-10-24 23:59:00'),
  55: Timestamp('2021-10-25 04:27:00'),
  56: Timestamp('2021-10-25 04:27:00'),
  57: Timestamp('2021-10-25 04:27:00')},
 'Product': {53: 'Product_1',
  54: 'Product_1',
  55: 'Product_1',
  56: 'Product_1',
  57: 'Product_1'}}

df_test = {'Temperature_1': {Timestamp('2021-10-24 23:55:00'): 48.0,
  Timestamp('2021-10-24 23:56:00'): 48.0,
  Timestamp('2021-10-24 23:57:00'): 48.0,
  Timestamp('2021-10-24 23:58:00'): 48.0,
  Timestamp('2021-10-24 23:59:00'): 48.0,
  Timestamp('2021-10-25 00:00:00'): 48.0,
  Timestamp('2021-10-25 00:01:00'): 48.0,
  Timestamp('2021-10-25 00:02:00'): 48.0},
 'Temperature_2': {Timestamp('2021-10-24 23:55:00'): 60.0,
  Timestamp('2021-10-24 23:56:00'): 60.0,
  Timestamp('2021-10-24 23:57:00'): 60.0,
  Timestamp('2021-10-24 23:58:00'): 60.0,
  Timestamp('2021-10-24 23:59:00'): 60.0,
  Timestamp('2021-10-25 00:00:00'): 59.0,
  Timestamp('2021-10-25 00:01:00'): 59.0,
  Timestamp('2021-10-25 00:02:00'): 59.0}}

感谢您的帮助和建议

Answer 1

间隔应该在 data_test 上创建，而不是 df_test。此外，您的 data_test 有重复项：

data_test = data_test.drop_duplicates()
data_test.index = pd.IntervalIndex.from_arrays(data_test['Start'],
                                               data_test['End'],
                                               closed='both')

product = (df_test
           .index
           .to_series()
           .apply(lambda df: data_test.iloc[data_test.index.get_loc(df), 
                                            data_test.columns.get_loc('Product')])
          )

df_test.assign(Product = product)
 
Temperature_1  Temperature_2    Product
2021-10-26 06:55:29      62.990524      66.829086  Product_1
2021-10-26 06:56:29      62.990524      66.829086  Product_1
2021-10-26 06:57:29      62.990524      66.829086  Product_1
2021-10-26 06:58:29      62.990524      66.829086  Product_1
2021-10-26 06:59:29      62.990524      66.829086  Product_1
2021-10-26 08:25:29      65.061195      67.044952  Product_2
2021-10-26 08:26:29      65.061195      67.044952  Product_2
2021-10-26 08:27:29      65.061195      67.044952  Product_2
2021-10-26 08:28:29      65.061195      66.040428  Product_2
2021-10-26 08:29:29      65.061195      66.040428  Product_2

对于更新的数据，你是对的，如果一个值不在间隔内，它将失败。有其他方法可以解决这个问题：

一种选择是使用 conditional_join from pyjanitor，这有助于抽象不等式连接：

# pip install pyjanitor
import pandas as pd
import janitor

data_test = pd.DataFrame(data_test)

df_test = pd.DataFrame(df_test)


df_test.index.name = 'Timestamp'

(df_test
  .reset_index()
  .conditional_join(
       data_test, 
       ('Timestamp', 'Start', '>='), 
       ('Timestamp', 'End', '<='), how = 'left')
  .loc[:, ['Timestamp', 'Temperature_1', 'Temperature_2', 'Product']]
  .set_index('Timestamp')
) 
                     Temperature_1  Temperature_2    Product
Timestamp                                                   
2021-10-24 23:55:00           48.0           60.0  Product_1
2021-10-24 23:55:00           48.0           60.0  Product_1
2021-10-24 23:56:00           48.0           60.0  Product_1
2021-10-24 23:56:00           48.0           60.0  Product_1
2021-10-24 23:57:00           48.0           60.0  Product_1
2021-10-24 23:57:00           48.0           60.0  Product_1
2021-10-24 23:58:00           48.0           60.0  Product_1
2021-10-24 23:58:00           48.0           60.0  Product_1
2021-10-24 23:59:00           48.0           60.0  Product_1
2021-10-24 23:59:00           48.0           60.0  Product_1
2021-10-25 00:00:00           48.0           59.0        NaN
2021-10-25 00:01:00           48.0           59.0  Product_1
2021-10-25 00:01:00           48.0           59.0  Product_1
2021-10-25 00:01:00           48.0           59.0  Product_1
2021-10-25 00:02:00           48.0           59.0  Product_1
2021-10-25 00:02:00           48.0           59.0  Product_1
2021-10-25 00:02:00           48.0           59.0  Product_1

另一个选项涉及IntervalIndex；但是，我们不使用 apply，而是使用 for 循环（apply 是一种 for 循环）：

# start afresh

data_test = pd.DataFrame(data_test)

df_test = pd.DataFrame(df_test)

# build the intervals
intervals = pd.IntervalIndex.from_arrays(data_test['Start'],
                                         data_test['End'],
                                         closed='both')

data_test.index = intervals

values = {}

# create dictionary of values found in the intervals
for val in df_test.index:
    present = intervals.contains(val)
    if present.any(): # we found something!
        values[val] = intervals[present]

values = pd.Series(values).explode()

# reindex and create a temporary column
df_test.loc[values.index, 'intervals'] = values.array

# use the temporary column to merge
(df_test
  .merge(data_test.Product, 
         left_on='intervals', 
         right_index = True, 
         how = 'left')
  .drop(columns='intervals')
)

                     Temperature_1  Temperature_2    Product
2021-10-24 23:55:00           48.0           60.0  Product_1
2021-10-24 23:55:00           48.0           60.0  Product_1
2021-10-24 23:56:00           48.0           60.0  Product_1
2021-10-24 23:56:00           48.0           60.0  Product_1
2021-10-24 23:57:00           48.0           60.0  Product_1
2021-10-24 23:57:00           48.0           60.0  Product_1
2021-10-24 23:58:00           48.0           60.0  Product_1
2021-10-24 23:58:00           48.0           60.0  Product_1
2021-10-24 23:59:00           48.0           60.0  Product_1
2021-10-24 23:59:00           48.0           60.0  Product_1
2021-10-25 00:00:00           48.0           59.0        NaN
2021-10-25 00:01:00           48.0           59.0  Product_1
2021-10-25 00:01:00           48.0           59.0  Product_1
2021-10-25 00:01:00           48.0           59.0  Product_1
2021-10-25 00:02:00           48.0           59.0  Product_1
2021-10-25 00:02:00           48.0           59.0  Product_1
2021-10-25 00:02:00           48.0           59.0  Product_1

使用 "interval" 数据框加入时间序列数据框

Join a time-series dataframe with an "interval" dataframe

python

datetime

intervals

dataframe

pandas