通过重复日期 + uniqueID 在 Pandas 中合并 Intervalindex

Intervalindex merging in Pandas through duplicate dates + uniqueID

SO第一题,还在学习python&pandas

编辑:我已经成功地将值 DF 从长转为宽,以便拥有唯一的 id+date 索引(例如,没有 uniqueID 每天超过 1 行)。然而,我还是没能达到我想要的结果。

我有几个想要合并的 DF,基于 A) uniqueID 和 B) 如果在不同和多个日期范围之间考虑该 uniqueID。我找到了 which approaches to what I'm looking for; however, after the solution isn't viable and digging a bit it would appear as if what I'm attempting isn't possible due to dates overlap (?)

要点是:如果 uniqueID 在 df_dates_range 中并且其对应的日期列在 start:end 范围内,则将 df_values 上的所有值相加来自 dates_ranges。

每个 DF 中还有更多的列,但这些是相关的列。暗示到处都是重复的,没有特定的顺序。所有 DF 系列的格式都正确。

所以,这里是 df1,dates_range:

import pandas as pd
import numpy as np

dates_range = {"uniqueID": [1, 2, 3, 4, 1, 7, 10, 11, 3, 4, 7, 10],
               "start": ["12/31/2019", "12/31/2019", "12/31/2019", "12/31/2019", "02/01/2020", "02/01/2020", "02/01/2020", "02/01/2020", "03/03/2020", "03/03/2020", "03/03/2020", "03/03/2020"],
               "end": ["01/04/2020", "01/04/2020", "01/04/2020", "01/04/2020", "02/05/2020", "02/05/2020", "02/05/2020", "02/05/2020", "03/08/2020", "03/08/2020", "03/08/2020", "03/08/2020"],
               "df1_tag1": ["v1", "v1", "v1", "v1", "v2", "v2", "v2", "v2", "v3", "v3", "v3", "v3"]}

df_dates_range = pd.DataFrame(dates_range, 
                              columns = ["uniqueID", 
                                         "start", 
                                         "end", 
                                         "df1_tag1"])

df_dates_range[["start","end"]] = df_dates_range[["start","end"]].apply(pd.to_datetime, infer_datetime_format = True)

和df2,值:

values = {"uniqueID": [1, 2, 7, 3, 4, 4, 10, 1, 8, 7, 10, 9, 10, 8, 3, 10, 11, 3, 7, 4, 10, 14], 
          "df2_tag1": ["abc", "abc", "abc", "abc", "abc", "def", "abc", "abc", "abc", "abc", "abc", "abc", "def", "def", "abc", "abc", "abc", "def", "abc", "abc", "def", "abc"], 
          "df2_tag2": ["type 1", "type 1", "type 2", "type 2", "type 1", "type 2", "type 1", "type 2", "type 2", "type 1", "type 2", "type 1", "type 1", "type 2", "type 1", "type 1", "type 2", "type 1", "type 2", "type 1", "type 1", "type 1"], 
          "day": ["01/01/2020", "01/02/2020", "01/03/2020", "01/03/2020", "01/04/2020", "01/04/2020", "01/04/2020", "02/01/2020", "02/02/2020", "02/03/2020", "02/03/2020", "02/04/2020", "02/05/2020", "02/05/2020", "03/03/2020", "03/04/2020", "03/04/2020", "03/06/2020", "03/06/2020", "03/07/2020", "03/06/2020", "04/08/2020"],
          "df2_value1": [2, 10, 6, 5, 7, 9, 3, 10, 9, 7, 4, 9, 1, 8, 7, 5, 4, 4, 2, 8, 8, 4], 
          "df2_value2": [1, 5, 10, 13, 15, 10, 12, 50, 3, 10, 2, 1, 4, 6, 80, 45, 3, 30, 20, 7.5, 15, 3], 
          "df2_value3": [0.547, 2.160, 0.004, 9.202, 7.518, 1.076, 1.139, 25.375, 0.537, 7.996, 1.475, 0.319, 1.118, 2.927, 7.820, 19.755, 2.529, 2.680, 17.762, 0.814, 1.201, 2.712]}

values["day"] = pd.to_datetime(values["day"], format = "%m/%d/%Y")

df_values = pd.DataFrame(values, 
                         columns = ["uniqueID", 
                                    "df2_tag1", 
                                    "df2_tag2", 
                                    "day", 
                                    "df2_value1", 
                                    "df2_value2", 
                                    "df2_value1"])

从第一个 link 开始,我尝试了 运行 以下方法:

df_dates_range.index = pd.IntervalIndex.from_arrays(df_dates_range["start"], 
                                                        df_dates_range["end"], 
                                                        closed = "both")

df_values_date_index = df_values.set_index(pd.DatetimeIndex(df_values["day"]))

df_values = df_values_date_index["day"].apply( lambda x : df_values_date_index.iloc[df_values_date_index.index.get_indexer_non_unique(x)])

但是,我得到了这个错误。 n00b 检查后,删除了倒数第二天的索引,问题仍然存在:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-58-54ea384e06f7> in <module>
     14 df_values_date_index = df_values.set_index(pd.DatetimeIndex(df_values["day"]))
     15 
---> 16 df_values = df_values_date_index["day"].apply( lambda x : df_values_date_index.iloc[df_values_date_index.index.get_indexer_non_unique(x)])

C:\anaconda\lib\site-packages\pandas\core\series.py in apply(self, func, convert_dtype, args, **kwds)
   3846             else:
   3847                 values = self.astype(object).values
-> 3848                 mapped = lib.map_infer(values, f, convert=convert_dtype)
   3849 
   3850         if len(mapped) and isinstance(mapped[0], Series):

pandas\_libs\lib.pyx in pandas._libs.lib.map_infer()

<ipython-input-58-54ea384e06f7> in <lambda>(x)
     14 df_values_date_index = df_values.set_index(pd.DatetimeIndex(df_values["day"]))
     15 
---> 16 df_values = df_values_date_index["day"].apply( lambda x : df_values_date_index.iloc[df_values_date_index.index.get_indexer_non_unique(x)])

C:\anaconda\lib\site-packages\pandas\core\indexes\base.py in get_indexer_non_unique(self, target)
   4471     @Appender(_index_shared_docs["get_indexer_non_unique"] % _index_doc_kwargs)
   4472     def get_indexer_non_unique(self, target):
-> 4473         target = ensure_index(target)
   4474         pself, ptarget = self._maybe_promote(target)
   4475         if pself is not self or ptarget is not target:

C:\anaconda\lib\site-packages\pandas\core\indexes\base.py in ensure_index(index_like, copy)
   5355             index_like = copy(index_like)
   5356 
-> 5357     return Index(index_like)
   5358 
   5359 

C:\anaconda\lib\site-packages\pandas\core\indexes\base.py in __new__(cls, data, dtype, copy, name, tupleize_cols, **kwargs)
    420             return Index(np.asarray(data), dtype=dtype, copy=copy, name=name, **kwargs)
    421         elif data is None or is_scalar(data):
--> 422             raise cls._scalar_data_error(data)
    423         else:
    424             if tupleize_cols and is_list_like(data):

TypeError: Index(...) must be called with a collection of some kind, Timestamp('2020-01-01 00:00:00') was passed

预期结果为:

desired = {"uniqueID": [1, 2, 3, 4, 1, 7, 10, 11, 3, 4, 7, 10],
           "start": ["12/31/2019", "12/31/2019", "12/31/2019", "12/31/2019", "02/01/2020", "02/01/2020", "02/01/2020", "02/01/2020", "03/03/2020", "03/03/2020", "03/03/2020", "03/03/2020"],
           "end": ["01/04/2020", "01/04/2020", "01/04/2020", "01/04/2020", "02/05/2020", "02/05/2020", "02/05/2020", "02/05/2020", "03/08/2020", "03/08/2020", "03/08/2020", "03/08/2020"],
           "df1_tag1": ["v1", "v1", "v1", "v1", "v2", "v2", "v2", "v2", "v3", "v3", "v3", "v3"],
           "df2_tag1": ["abc", "abc", "abc", "abc", "abc", "abc", "abc", "abc", "abc", "abc", "abc", "abc"],
           "df2_value1": [2, 10, 5, 16, 10, 7, 5, np.nan, 11, 8, 2, 8], 
           "df2_value2+df2_value3": [1.547, 7.160, 22.202, 33.595, 75.375, 17.996, 8.594,  np.nan, 120.501, 8.314, 37.762, 16.201], 
           "df2_tag3": ["abc", "abc", "abc", "abc", "abc", "abc", "abc", "abc", "abc", "abc", "abc", "abc"]}


df_desired = pd.DataFrame(desired, 
                          columns = ["uniqueID", 
                                     "start", 
                                     "end", 
                                     "df1_tag1", 
                                     "df2_tag1", 
                                     "df2_value1", 
                                     "df2_value2+df2_value3", 
                                     "df2_tag3"])

df_desired[["start","end"]] = df_desired[["start","end"]].apply(pd.to_datetime, infer_datetime_format = True)

或图形可视化:

注意第 10 行的 S 和 T 列是 NaN,因为 uniqueID 11 在 v2 期间没有 "activity";但是,如果可能的话,我希望能够以某种方式从 df2 中提取标签;他们 100% 在那里,只是可能不是那个时期,也许是第二个脚本的任务?另外,请注意 col T 是 cols J+K

的聚合

编辑:忘记提及我之前曾尝试在 上使用@firelynx 的解决方案来执行此操作,但尽管我有 32gb 内存,但我的机器无法应付。 SQL 解决方案对我不起作用,因为有一些 sqlite3 库问题

在这些情况下,最简单的事情(如果您在硬件方面负担得起)是创建一个临时 DataFrame 并在之后进行聚合。这具有将合并与聚合分开的巨大优势,并极大地降低了复杂性。

In [22]: df = pd.merge(df_dates_range, df_values)                                                                                                                                                                
Out[22]: 
    uniqueID      start        end        day  value1  medium
0          1 2019-12-31 2020-01-04 2020-01-01       1  Online
1          1 2019-12-31 2020-01-04 2020-02-01      50  Online
2          1 2020-02-01 2020-02-05 2020-01-01       1  Online
3          1 2020-02-01 2020-02-05 2020-02-01      50  Online
4          2 2019-12-31 2020-01-04 2020-01-02       5    Shop
..       ...        ...        ...        ...     ...     ...
23        10 2020-02-01 2020-02-05 2020-03-04      45    Shop
24        10 2020-03-03 2020-03-08 2020-01-03      13    Shop
25        10 2020-03-03 2020-03-08 2020-02-03       2  Online
26        10 2020-03-03 2020-03-08 2020-03-04      45    Shop
27        11 2020-02-01 2020-02-05 2020-02-05       4    Shop

In [24]: df = df[(df['day'] > df['start']) & (df['day'] <= df['end'])]                                                                                                                                           
Out[24]: 
    uniqueID      start        end        day  value1  medium
0          1 2019-12-31 2020-01-04 2020-01-01       1  Online
4          2 2019-12-31 2020-01-04 2020-01-02       5    Shop
5          3 2019-12-31 2020-01-04 2020-01-04      12    Shop
10         3 2020-03-03 2020-03-08 2020-03-06      30  Online
11         4 2019-12-31 2020-01-04 2020-01-04      15  Online
12         4 2019-12-31 2020-01-04 2020-01-04      10    Shop
16         7 2020-02-01 2020-02-05 2020-02-03      10    Shop
20         7 2020-03-03 2020-03-08 2020-03-06      20    Shop
22        10 2020-02-01 2020-02-05 2020-02-03       2  Online
26        10 2020-03-03 2020-03-08 2020-03-04      45    Shop
27        11 2020-02-01 2020-02-05 2020-02-05       4    Shop

然后你可以做类似

的事情
In [30]: df.groupby(['start', 'end', 'uniqueID', 'medium'])['value1'].agg(['count', 'sum']).reset_index()                                                                                                   
Out[30]: 
        start        end  uniqueID  medium  count  sum
0  2019-12-31 2020-01-04         1  Online      1    1
1  2019-12-31 2020-01-04         2    Shop      1    5
2  2019-12-31 2020-01-04         3    Shop      1   12
3  2019-12-31 2020-01-04         4  Online      1   15
4  2019-12-31 2020-01-04         4    Shop      1   10
5  2020-02-01 2020-02-05         7    Shop      1   10
6  2020-02-01 2020-02-05        10  Online      1    2
7  2020-02-01 2020-02-05        11    Shop      1    4
8  2020-03-03 2020-03-08         3  Online      1   30
9  2020-03-03 2020-03-08         7    Shop      1   20
10 2020-03-03 2020-03-08        10    Shop      1   45

以所需的形式聚合数据。但是,我没有得到您期望的结果。在值中有 Shop 的行,有些日期有点。我责怪初始值 ;) 希望这会将您推向正确的方向。

注意事项:如果您只对区间的第一个或最后一个值感兴趣,pd.merge_asof 是一个有趣的选择

In [17]: pd.merge_asof(df_dates_range, df_values, left_on='start', right_on='day', by='uniqueID', direction='forward')                                                                                      
Out[17]: 
    uniqueID      start        end        day  value1  medium
0          1 2019-12-31 2020-01-04 2020-01-01     1.0  Online
1          2 2019-12-31 2020-01-04 2020-01-02     5.0    Shop
2          3 2019-12-31 2020-01-04 2020-01-04    12.0    Shop
3          4 2019-12-31 2020-01-04 2020-01-04    15.0  Online
4          1 2020-02-01 2020-02-05 2020-02-01    50.0  Online
5          7 2020-02-01 2020-02-05 2020-02-03    10.0    Shop
6         10 2020-02-01 2020-02-05 2020-02-03     2.0  Online
7         11 2020-02-01 2020-02-05 2020-02-05     4.0    Shop
8          3 2020-03-03 2020-03-08 2020-03-03    80.0  Online
9          4 2020-03-03 2020-03-08        NaT     NaN     NaN
10         7 2020-03-03 2020-03-08 2020-03-06    20.0    Shop
11        10 2020-03-03 2020-03-08 2020-03-04    45.0    Shop

但是,实际上不可能将聚合压缩到其中。

终于破解了。

由于 IntervalIndex 只能处理唯一日期,我所做的是将这些唯​​一 start:end 间隔及其唯一标记映射到 df_values。我犯的错误是使用整个 df_dates_ranges 作为 intervalindex 数组值,所以这只是提取唯一值的问题。我不清楚的一件事是会发生什么 when/if 任何间隔范围都有不止一个适用的 df1_tag1 值,希望它只会创建一个标签列表并且它会起作用不管怎样

请记住,在执行以下操作之前,我需要将 df_values 从长格式转换为宽格式,因为我在生成重复的 uniqueID+value 的级别上使用了 group_by行。出于某种原因,我无法使用此处的示例数据来执行此操作,但无论如何,如果您的数据格式为 (wide/long),则以下内容应该有效,您需要它以避免 df_values 具有重复的 uniqueID+day 行。

之后我做了以下事情:

# In order to bypass the non "uniqueness" of the desired tag to apply, we create a list of the unique df1_tag1's with their respective start:end dates

df1_tag1_list = df_dates_range.groupby(["start", 
                                        "end", 
                                        "df1_tag1"]).size().reset_index().rename(columns={0:'records'})

然后,

# Create a new pandas IntervalIndex series variable to then map ("paste onto") the copied df_values using 
applicable_df1_tag1 = pd.Series(df1_tag1_list["df1_tag1"].values, 
                                pd.IntervalIndex.from_arrays(df1_tag1_list['start'], 
                                                             df1_tag1_list['end']))

# map the df1_tag1 to the applicable rows in the copied df_values
df_values_with_df1_tag1["applicable_df1_tag1"] = df_values_with_df1_tag1["day"].map(applicable_df1_tag1)

这个结果应该是聚合的 df_values - 或者您在 groupby 期间执行的任何其他数学函数 - 具有现在具有映射 df1_tag1 的非重复 uniqueID+day 行然后我们可以使用合并到 df_dates_range 以及 uniqueID

希望这是对某些人有效的答案:)

编辑:也可能很重要,当我进行左合并时,我使用了以下内容以避免不必要的重复

df_date_ranges_all = df_date_ranges.merge(df_values_wide_with_df1_tag1.drop_duplicates(subset = ['uniqueID'], 
                                                                                               keep = "last"), 
                                            how = "left", 
                                            left_on = ["uniqueID", "df1_tag1"], 
                                            right_on = ["uniqueID", "applicable_df1_tag1"],
                                            indicator = True)