通过重复日期 + uniqueID 在 Pandas 中合并 Intervalindex
Intervalindex merging in Pandas through duplicate dates + uniqueID
SO第一题,还在学习python&pandas
编辑:我已经成功地将值 DF 从长转为宽,以便拥有唯一的 id+date 索引(例如,没有 uniqueID 每天超过 1 行)。然而,我还是没能达到我想要的结果。
我有几个想要合并的 DF,基于 A) uniqueID 和 B) 如果在不同和多个日期范围之间考虑该 uniqueID。我找到了 which approaches to what I'm looking for; however, after the solution isn't viable and digging a bit it would appear as if what I'm attempting isn't possible due to dates overlap (?)
要点是:如果 uniqueID 在 df_dates_range 中并且其对应的日期列在 start:end 范围内,则将 df_values 上的所有值相加来自 dates_ranges。
每个 DF 中还有更多的列,但这些是相关的列。暗示到处都是重复的,没有特定的顺序。所有 DF 系列的格式都正确。
所以,这里是 df1,dates_range:
import pandas as pd
import numpy as np
dates_range = {"uniqueID": [1, 2, 3, 4, 1, 7, 10, 11, 3, 4, 7, 10],
"start": ["12/31/2019", "12/31/2019", "12/31/2019", "12/31/2019", "02/01/2020", "02/01/2020", "02/01/2020", "02/01/2020", "03/03/2020", "03/03/2020", "03/03/2020", "03/03/2020"],
"end": ["01/04/2020", "01/04/2020", "01/04/2020", "01/04/2020", "02/05/2020", "02/05/2020", "02/05/2020", "02/05/2020", "03/08/2020", "03/08/2020", "03/08/2020", "03/08/2020"],
"df1_tag1": ["v1", "v1", "v1", "v1", "v2", "v2", "v2", "v2", "v3", "v3", "v3", "v3"]}
df_dates_range = pd.DataFrame(dates_range,
columns = ["uniqueID",
"start",
"end",
"df1_tag1"])
df_dates_range[["start","end"]] = df_dates_range[["start","end"]].apply(pd.to_datetime, infer_datetime_format = True)
和df2,值:
values = {"uniqueID": [1, 2, 7, 3, 4, 4, 10, 1, 8, 7, 10, 9, 10, 8, 3, 10, 11, 3, 7, 4, 10, 14],
"df2_tag1": ["abc", "abc", "abc", "abc", "abc", "def", "abc", "abc", "abc", "abc", "abc", "abc", "def", "def", "abc", "abc", "abc", "def", "abc", "abc", "def", "abc"],
"df2_tag2": ["type 1", "type 1", "type 2", "type 2", "type 1", "type 2", "type 1", "type 2", "type 2", "type 1", "type 2", "type 1", "type 1", "type 2", "type 1", "type 1", "type 2", "type 1", "type 2", "type 1", "type 1", "type 1"],
"day": ["01/01/2020", "01/02/2020", "01/03/2020", "01/03/2020", "01/04/2020", "01/04/2020", "01/04/2020", "02/01/2020", "02/02/2020", "02/03/2020", "02/03/2020", "02/04/2020", "02/05/2020", "02/05/2020", "03/03/2020", "03/04/2020", "03/04/2020", "03/06/2020", "03/06/2020", "03/07/2020", "03/06/2020", "04/08/2020"],
"df2_value1": [2, 10, 6, 5, 7, 9, 3, 10, 9, 7, 4, 9, 1, 8, 7, 5, 4, 4, 2, 8, 8, 4],
"df2_value2": [1, 5, 10, 13, 15, 10, 12, 50, 3, 10, 2, 1, 4, 6, 80, 45, 3, 30, 20, 7.5, 15, 3],
"df2_value3": [0.547, 2.160, 0.004, 9.202, 7.518, 1.076, 1.139, 25.375, 0.537, 7.996, 1.475, 0.319, 1.118, 2.927, 7.820, 19.755, 2.529, 2.680, 17.762, 0.814, 1.201, 2.712]}
values["day"] = pd.to_datetime(values["day"], format = "%m/%d/%Y")
df_values = pd.DataFrame(values,
columns = ["uniqueID",
"df2_tag1",
"df2_tag2",
"day",
"df2_value1",
"df2_value2",
"df2_value1"])
从第一个 link 开始,我尝试了 运行 以下方法:
df_dates_range.index = pd.IntervalIndex.from_arrays(df_dates_range["start"],
df_dates_range["end"],
closed = "both")
df_values_date_index = df_values.set_index(pd.DatetimeIndex(df_values["day"]))
df_values = df_values_date_index["day"].apply( lambda x : df_values_date_index.iloc[df_values_date_index.index.get_indexer_non_unique(x)])
但是,我得到了这个错误。 n00b 检查后,删除了倒数第二天的索引,问题仍然存在:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-58-54ea384e06f7> in <module>
14 df_values_date_index = df_values.set_index(pd.DatetimeIndex(df_values["day"]))
15
---> 16 df_values = df_values_date_index["day"].apply( lambda x : df_values_date_index.iloc[df_values_date_index.index.get_indexer_non_unique(x)])
C:\anaconda\lib\site-packages\pandas\core\series.py in apply(self, func, convert_dtype, args, **kwds)
3846 else:
3847 values = self.astype(object).values
-> 3848 mapped = lib.map_infer(values, f, convert=convert_dtype)
3849
3850 if len(mapped) and isinstance(mapped[0], Series):
pandas\_libs\lib.pyx in pandas._libs.lib.map_infer()
<ipython-input-58-54ea384e06f7> in <lambda>(x)
14 df_values_date_index = df_values.set_index(pd.DatetimeIndex(df_values["day"]))
15
---> 16 df_values = df_values_date_index["day"].apply( lambda x : df_values_date_index.iloc[df_values_date_index.index.get_indexer_non_unique(x)])
C:\anaconda\lib\site-packages\pandas\core\indexes\base.py in get_indexer_non_unique(self, target)
4471 @Appender(_index_shared_docs["get_indexer_non_unique"] % _index_doc_kwargs)
4472 def get_indexer_non_unique(self, target):
-> 4473 target = ensure_index(target)
4474 pself, ptarget = self._maybe_promote(target)
4475 if pself is not self or ptarget is not target:
C:\anaconda\lib\site-packages\pandas\core\indexes\base.py in ensure_index(index_like, copy)
5355 index_like = copy(index_like)
5356
-> 5357 return Index(index_like)
5358
5359
C:\anaconda\lib\site-packages\pandas\core\indexes\base.py in __new__(cls, data, dtype, copy, name, tupleize_cols, **kwargs)
420 return Index(np.asarray(data), dtype=dtype, copy=copy, name=name, **kwargs)
421 elif data is None or is_scalar(data):
--> 422 raise cls._scalar_data_error(data)
423 else:
424 if tupleize_cols and is_list_like(data):
TypeError: Index(...) must be called with a collection of some kind, Timestamp('2020-01-01 00:00:00') was passed
预期结果为:
desired = {"uniqueID": [1, 2, 3, 4, 1, 7, 10, 11, 3, 4, 7, 10],
"start": ["12/31/2019", "12/31/2019", "12/31/2019", "12/31/2019", "02/01/2020", "02/01/2020", "02/01/2020", "02/01/2020", "03/03/2020", "03/03/2020", "03/03/2020", "03/03/2020"],
"end": ["01/04/2020", "01/04/2020", "01/04/2020", "01/04/2020", "02/05/2020", "02/05/2020", "02/05/2020", "02/05/2020", "03/08/2020", "03/08/2020", "03/08/2020", "03/08/2020"],
"df1_tag1": ["v1", "v1", "v1", "v1", "v2", "v2", "v2", "v2", "v3", "v3", "v3", "v3"],
"df2_tag1": ["abc", "abc", "abc", "abc", "abc", "abc", "abc", "abc", "abc", "abc", "abc", "abc"],
"df2_value1": [2, 10, 5, 16, 10, 7, 5, np.nan, 11, 8, 2, 8],
"df2_value2+df2_value3": [1.547, 7.160, 22.202, 33.595, 75.375, 17.996, 8.594, np.nan, 120.501, 8.314, 37.762, 16.201],
"df2_tag3": ["abc", "abc", "abc", "abc", "abc", "abc", "abc", "abc", "abc", "abc", "abc", "abc"]}
df_desired = pd.DataFrame(desired,
columns = ["uniqueID",
"start",
"end",
"df1_tag1",
"df2_tag1",
"df2_value1",
"df2_value2+df2_value3",
"df2_tag3"])
df_desired[["start","end"]] = df_desired[["start","end"]].apply(pd.to_datetime, infer_datetime_format = True)
或图形可视化:
注意第 10 行的 S 和 T 列是 NaN,因为 uniqueID 11 在 v2 期间没有 "activity";但是,如果可能的话,我希望能够以某种方式从 df2 中提取标签;他们 100% 在那里,只是可能不是那个时期,也许是第二个脚本的任务?另外,请注意 col T 是 cols J+K
的聚合
编辑:忘记提及我之前曾尝试在 上使用@firelynx 的解决方案来执行此操作,但尽管我有 32gb 内存,但我的机器无法应付。 SQL 解决方案对我不起作用,因为有一些 sqlite3 库问题
在这些情况下,最简单的事情(如果您在硬件方面负担得起)是创建一个临时 DataFrame 并在之后进行聚合。这具有将合并与聚合分开的巨大优势,并极大地降低了复杂性。
In [22]: df = pd.merge(df_dates_range, df_values)
Out[22]:
uniqueID start end day value1 medium
0 1 2019-12-31 2020-01-04 2020-01-01 1 Online
1 1 2019-12-31 2020-01-04 2020-02-01 50 Online
2 1 2020-02-01 2020-02-05 2020-01-01 1 Online
3 1 2020-02-01 2020-02-05 2020-02-01 50 Online
4 2 2019-12-31 2020-01-04 2020-01-02 5 Shop
.. ... ... ... ... ... ...
23 10 2020-02-01 2020-02-05 2020-03-04 45 Shop
24 10 2020-03-03 2020-03-08 2020-01-03 13 Shop
25 10 2020-03-03 2020-03-08 2020-02-03 2 Online
26 10 2020-03-03 2020-03-08 2020-03-04 45 Shop
27 11 2020-02-01 2020-02-05 2020-02-05 4 Shop
In [24]: df = df[(df['day'] > df['start']) & (df['day'] <= df['end'])]
Out[24]:
uniqueID start end day value1 medium
0 1 2019-12-31 2020-01-04 2020-01-01 1 Online
4 2 2019-12-31 2020-01-04 2020-01-02 5 Shop
5 3 2019-12-31 2020-01-04 2020-01-04 12 Shop
10 3 2020-03-03 2020-03-08 2020-03-06 30 Online
11 4 2019-12-31 2020-01-04 2020-01-04 15 Online
12 4 2019-12-31 2020-01-04 2020-01-04 10 Shop
16 7 2020-02-01 2020-02-05 2020-02-03 10 Shop
20 7 2020-03-03 2020-03-08 2020-03-06 20 Shop
22 10 2020-02-01 2020-02-05 2020-02-03 2 Online
26 10 2020-03-03 2020-03-08 2020-03-04 45 Shop
27 11 2020-02-01 2020-02-05 2020-02-05 4 Shop
然后你可以做类似
的事情
In [30]: df.groupby(['start', 'end', 'uniqueID', 'medium'])['value1'].agg(['count', 'sum']).reset_index()
Out[30]:
start end uniqueID medium count sum
0 2019-12-31 2020-01-04 1 Online 1 1
1 2019-12-31 2020-01-04 2 Shop 1 5
2 2019-12-31 2020-01-04 3 Shop 1 12
3 2019-12-31 2020-01-04 4 Online 1 15
4 2019-12-31 2020-01-04 4 Shop 1 10
5 2020-02-01 2020-02-05 7 Shop 1 10
6 2020-02-01 2020-02-05 10 Online 1 2
7 2020-02-01 2020-02-05 11 Shop 1 4
8 2020-03-03 2020-03-08 3 Online 1 30
9 2020-03-03 2020-03-08 7 Shop 1 20
10 2020-03-03 2020-03-08 10 Shop 1 45
以所需的形式聚合数据。但是,我没有得到您期望的结果。在值中有 Shop
的行,有些日期有点。我责怪初始值 ;) 希望这会将您推向正确的方向。
注意事项:如果您只对区间的第一个或最后一个值感兴趣,pd.merge_asof
是一个有趣的选择
In [17]: pd.merge_asof(df_dates_range, df_values, left_on='start', right_on='day', by='uniqueID', direction='forward')
Out[17]:
uniqueID start end day value1 medium
0 1 2019-12-31 2020-01-04 2020-01-01 1.0 Online
1 2 2019-12-31 2020-01-04 2020-01-02 5.0 Shop
2 3 2019-12-31 2020-01-04 2020-01-04 12.0 Shop
3 4 2019-12-31 2020-01-04 2020-01-04 15.0 Online
4 1 2020-02-01 2020-02-05 2020-02-01 50.0 Online
5 7 2020-02-01 2020-02-05 2020-02-03 10.0 Shop
6 10 2020-02-01 2020-02-05 2020-02-03 2.0 Online
7 11 2020-02-01 2020-02-05 2020-02-05 4.0 Shop
8 3 2020-03-03 2020-03-08 2020-03-03 80.0 Online
9 4 2020-03-03 2020-03-08 NaT NaN NaN
10 7 2020-03-03 2020-03-08 2020-03-06 20.0 Shop
11 10 2020-03-03 2020-03-08 2020-03-04 45.0 Shop
但是,实际上不可能将聚合压缩到其中。
终于破解了。
由于 IntervalIndex 只能处理唯一日期,我所做的是将这些唯一 start:end 间隔及其唯一标记映射到 df_values。我犯的错误是使用整个 df_dates_ranges 作为 intervalindex 数组值,所以这只是提取唯一值的问题。我不清楚的一件事是会发生什么 when/if 任何间隔范围都有不止一个适用的 df1_tag1 值,希望它只会创建一个标签列表并且它会起作用不管怎样
请记住,在执行以下操作之前,我需要将 df_values 从长格式转换为宽格式,因为我在生成重复的 uniqueID+value 的级别上使用了 group_by
行。出于某种原因,我无法使用此处的示例数据来执行此操作,但无论如何,如果您的数据格式为 (wide/long),则以下内容应该有效,您需要它以避免 df_values 具有重复的 uniqueID+day 行。
之后我做了以下事情:
# In order to bypass the non "uniqueness" of the desired tag to apply, we create a list of the unique df1_tag1's with their respective start:end dates
df1_tag1_list = df_dates_range.groupby(["start",
"end",
"df1_tag1"]).size().reset_index().rename(columns={0:'records'})
然后,
# Create a new pandas IntervalIndex series variable to then map ("paste onto") the copied df_values using
applicable_df1_tag1 = pd.Series(df1_tag1_list["df1_tag1"].values,
pd.IntervalIndex.from_arrays(df1_tag1_list['start'],
df1_tag1_list['end']))
# map the df1_tag1 to the applicable rows in the copied df_values
df_values_with_df1_tag1["applicable_df1_tag1"] = df_values_with_df1_tag1["day"].map(applicable_df1_tag1)
这个结果应该是聚合的 df_values - 或者您在 groupby 期间执行的任何其他数学函数 - 具有现在具有映射 df1_tag1 的非重复 uniqueID+day 行然后我们可以使用合并到 df_dates_range 以及 uniqueID
希望这是对某些人有效的答案:)
编辑:也可能很重要,当我进行左合并时,我使用了以下内容以避免不必要的重复
df_date_ranges_all = df_date_ranges.merge(df_values_wide_with_df1_tag1.drop_duplicates(subset = ['uniqueID'],
keep = "last"),
how = "left",
left_on = ["uniqueID", "df1_tag1"],
right_on = ["uniqueID", "applicable_df1_tag1"],
indicator = True)
SO第一题,还在学习python&pandas
编辑:我已经成功地将值 DF 从长转为宽,以便拥有唯一的 id+date 索引(例如,没有 uniqueID 每天超过 1 行)。然而,我还是没能达到我想要的结果。
我有几个想要合并的 DF,基于 A) uniqueID 和 B) 如果在不同和多个日期范围之间考虑该 uniqueID。我找到了
要点是:如果 uniqueID 在 df_dates_range 中并且其对应的日期列在 start:end 范围内,则将 df_values 上的所有值相加来自 dates_ranges。
每个 DF 中还有更多的列,但这些是相关的列。暗示到处都是重复的,没有特定的顺序。所有 DF 系列的格式都正确。
所以,这里是 df1,dates_range:
import pandas as pd
import numpy as np
dates_range = {"uniqueID": [1, 2, 3, 4, 1, 7, 10, 11, 3, 4, 7, 10],
"start": ["12/31/2019", "12/31/2019", "12/31/2019", "12/31/2019", "02/01/2020", "02/01/2020", "02/01/2020", "02/01/2020", "03/03/2020", "03/03/2020", "03/03/2020", "03/03/2020"],
"end": ["01/04/2020", "01/04/2020", "01/04/2020", "01/04/2020", "02/05/2020", "02/05/2020", "02/05/2020", "02/05/2020", "03/08/2020", "03/08/2020", "03/08/2020", "03/08/2020"],
"df1_tag1": ["v1", "v1", "v1", "v1", "v2", "v2", "v2", "v2", "v3", "v3", "v3", "v3"]}
df_dates_range = pd.DataFrame(dates_range,
columns = ["uniqueID",
"start",
"end",
"df1_tag1"])
df_dates_range[["start","end"]] = df_dates_range[["start","end"]].apply(pd.to_datetime, infer_datetime_format = True)
和df2,值:
values = {"uniqueID": [1, 2, 7, 3, 4, 4, 10, 1, 8, 7, 10, 9, 10, 8, 3, 10, 11, 3, 7, 4, 10, 14],
"df2_tag1": ["abc", "abc", "abc", "abc", "abc", "def", "abc", "abc", "abc", "abc", "abc", "abc", "def", "def", "abc", "abc", "abc", "def", "abc", "abc", "def", "abc"],
"df2_tag2": ["type 1", "type 1", "type 2", "type 2", "type 1", "type 2", "type 1", "type 2", "type 2", "type 1", "type 2", "type 1", "type 1", "type 2", "type 1", "type 1", "type 2", "type 1", "type 2", "type 1", "type 1", "type 1"],
"day": ["01/01/2020", "01/02/2020", "01/03/2020", "01/03/2020", "01/04/2020", "01/04/2020", "01/04/2020", "02/01/2020", "02/02/2020", "02/03/2020", "02/03/2020", "02/04/2020", "02/05/2020", "02/05/2020", "03/03/2020", "03/04/2020", "03/04/2020", "03/06/2020", "03/06/2020", "03/07/2020", "03/06/2020", "04/08/2020"],
"df2_value1": [2, 10, 6, 5, 7, 9, 3, 10, 9, 7, 4, 9, 1, 8, 7, 5, 4, 4, 2, 8, 8, 4],
"df2_value2": [1, 5, 10, 13, 15, 10, 12, 50, 3, 10, 2, 1, 4, 6, 80, 45, 3, 30, 20, 7.5, 15, 3],
"df2_value3": [0.547, 2.160, 0.004, 9.202, 7.518, 1.076, 1.139, 25.375, 0.537, 7.996, 1.475, 0.319, 1.118, 2.927, 7.820, 19.755, 2.529, 2.680, 17.762, 0.814, 1.201, 2.712]}
values["day"] = pd.to_datetime(values["day"], format = "%m/%d/%Y")
df_values = pd.DataFrame(values,
columns = ["uniqueID",
"df2_tag1",
"df2_tag2",
"day",
"df2_value1",
"df2_value2",
"df2_value1"])
从第一个 link 开始,我尝试了 运行 以下方法:
df_dates_range.index = pd.IntervalIndex.from_arrays(df_dates_range["start"],
df_dates_range["end"],
closed = "both")
df_values_date_index = df_values.set_index(pd.DatetimeIndex(df_values["day"]))
df_values = df_values_date_index["day"].apply( lambda x : df_values_date_index.iloc[df_values_date_index.index.get_indexer_non_unique(x)])
但是,我得到了这个错误。 n00b 检查后,删除了倒数第二天的索引,问题仍然存在:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-58-54ea384e06f7> in <module>
14 df_values_date_index = df_values.set_index(pd.DatetimeIndex(df_values["day"]))
15
---> 16 df_values = df_values_date_index["day"].apply( lambda x : df_values_date_index.iloc[df_values_date_index.index.get_indexer_non_unique(x)])
C:\anaconda\lib\site-packages\pandas\core\series.py in apply(self, func, convert_dtype, args, **kwds)
3846 else:
3847 values = self.astype(object).values
-> 3848 mapped = lib.map_infer(values, f, convert=convert_dtype)
3849
3850 if len(mapped) and isinstance(mapped[0], Series):
pandas\_libs\lib.pyx in pandas._libs.lib.map_infer()
<ipython-input-58-54ea384e06f7> in <lambda>(x)
14 df_values_date_index = df_values.set_index(pd.DatetimeIndex(df_values["day"]))
15
---> 16 df_values = df_values_date_index["day"].apply( lambda x : df_values_date_index.iloc[df_values_date_index.index.get_indexer_non_unique(x)])
C:\anaconda\lib\site-packages\pandas\core\indexes\base.py in get_indexer_non_unique(self, target)
4471 @Appender(_index_shared_docs["get_indexer_non_unique"] % _index_doc_kwargs)
4472 def get_indexer_non_unique(self, target):
-> 4473 target = ensure_index(target)
4474 pself, ptarget = self._maybe_promote(target)
4475 if pself is not self or ptarget is not target:
C:\anaconda\lib\site-packages\pandas\core\indexes\base.py in ensure_index(index_like, copy)
5355 index_like = copy(index_like)
5356
-> 5357 return Index(index_like)
5358
5359
C:\anaconda\lib\site-packages\pandas\core\indexes\base.py in __new__(cls, data, dtype, copy, name, tupleize_cols, **kwargs)
420 return Index(np.asarray(data), dtype=dtype, copy=copy, name=name, **kwargs)
421 elif data is None or is_scalar(data):
--> 422 raise cls._scalar_data_error(data)
423 else:
424 if tupleize_cols and is_list_like(data):
TypeError: Index(...) must be called with a collection of some kind, Timestamp('2020-01-01 00:00:00') was passed
预期结果为:
desired = {"uniqueID": [1, 2, 3, 4, 1, 7, 10, 11, 3, 4, 7, 10],
"start": ["12/31/2019", "12/31/2019", "12/31/2019", "12/31/2019", "02/01/2020", "02/01/2020", "02/01/2020", "02/01/2020", "03/03/2020", "03/03/2020", "03/03/2020", "03/03/2020"],
"end": ["01/04/2020", "01/04/2020", "01/04/2020", "01/04/2020", "02/05/2020", "02/05/2020", "02/05/2020", "02/05/2020", "03/08/2020", "03/08/2020", "03/08/2020", "03/08/2020"],
"df1_tag1": ["v1", "v1", "v1", "v1", "v2", "v2", "v2", "v2", "v3", "v3", "v3", "v3"],
"df2_tag1": ["abc", "abc", "abc", "abc", "abc", "abc", "abc", "abc", "abc", "abc", "abc", "abc"],
"df2_value1": [2, 10, 5, 16, 10, 7, 5, np.nan, 11, 8, 2, 8],
"df2_value2+df2_value3": [1.547, 7.160, 22.202, 33.595, 75.375, 17.996, 8.594, np.nan, 120.501, 8.314, 37.762, 16.201],
"df2_tag3": ["abc", "abc", "abc", "abc", "abc", "abc", "abc", "abc", "abc", "abc", "abc", "abc"]}
df_desired = pd.DataFrame(desired,
columns = ["uniqueID",
"start",
"end",
"df1_tag1",
"df2_tag1",
"df2_value1",
"df2_value2+df2_value3",
"df2_tag3"])
df_desired[["start","end"]] = df_desired[["start","end"]].apply(pd.to_datetime, infer_datetime_format = True)
或图形可视化:
注意第 10 行的 S 和 T 列是 NaN,因为 uniqueID 11 在 v2 期间没有 "activity";但是,如果可能的话,我希望能够以某种方式从 df2 中提取标签;他们 100% 在那里,只是可能不是那个时期,也许是第二个脚本的任务?另外,请注意 col T 是 cols J+K
的聚合编辑:忘记提及我之前曾尝试在
在这些情况下,最简单的事情(如果您在硬件方面负担得起)是创建一个临时 DataFrame 并在之后进行聚合。这具有将合并与聚合分开的巨大优势,并极大地降低了复杂性。
In [22]: df = pd.merge(df_dates_range, df_values)
Out[22]:
uniqueID start end day value1 medium
0 1 2019-12-31 2020-01-04 2020-01-01 1 Online
1 1 2019-12-31 2020-01-04 2020-02-01 50 Online
2 1 2020-02-01 2020-02-05 2020-01-01 1 Online
3 1 2020-02-01 2020-02-05 2020-02-01 50 Online
4 2 2019-12-31 2020-01-04 2020-01-02 5 Shop
.. ... ... ... ... ... ...
23 10 2020-02-01 2020-02-05 2020-03-04 45 Shop
24 10 2020-03-03 2020-03-08 2020-01-03 13 Shop
25 10 2020-03-03 2020-03-08 2020-02-03 2 Online
26 10 2020-03-03 2020-03-08 2020-03-04 45 Shop
27 11 2020-02-01 2020-02-05 2020-02-05 4 Shop
In [24]: df = df[(df['day'] > df['start']) & (df['day'] <= df['end'])]
Out[24]:
uniqueID start end day value1 medium
0 1 2019-12-31 2020-01-04 2020-01-01 1 Online
4 2 2019-12-31 2020-01-04 2020-01-02 5 Shop
5 3 2019-12-31 2020-01-04 2020-01-04 12 Shop
10 3 2020-03-03 2020-03-08 2020-03-06 30 Online
11 4 2019-12-31 2020-01-04 2020-01-04 15 Online
12 4 2019-12-31 2020-01-04 2020-01-04 10 Shop
16 7 2020-02-01 2020-02-05 2020-02-03 10 Shop
20 7 2020-03-03 2020-03-08 2020-03-06 20 Shop
22 10 2020-02-01 2020-02-05 2020-02-03 2 Online
26 10 2020-03-03 2020-03-08 2020-03-04 45 Shop
27 11 2020-02-01 2020-02-05 2020-02-05 4 Shop
然后你可以做类似
的事情In [30]: df.groupby(['start', 'end', 'uniqueID', 'medium'])['value1'].agg(['count', 'sum']).reset_index()
Out[30]:
start end uniqueID medium count sum
0 2019-12-31 2020-01-04 1 Online 1 1
1 2019-12-31 2020-01-04 2 Shop 1 5
2 2019-12-31 2020-01-04 3 Shop 1 12
3 2019-12-31 2020-01-04 4 Online 1 15
4 2019-12-31 2020-01-04 4 Shop 1 10
5 2020-02-01 2020-02-05 7 Shop 1 10
6 2020-02-01 2020-02-05 10 Online 1 2
7 2020-02-01 2020-02-05 11 Shop 1 4
8 2020-03-03 2020-03-08 3 Online 1 30
9 2020-03-03 2020-03-08 7 Shop 1 20
10 2020-03-03 2020-03-08 10 Shop 1 45
以所需的形式聚合数据。但是,我没有得到您期望的结果。在值中有 Shop
的行,有些日期有点。我责怪初始值 ;) 希望这会将您推向正确的方向。
注意事项:如果您只对区间的第一个或最后一个值感兴趣,pd.merge_asof
是一个有趣的选择
In [17]: pd.merge_asof(df_dates_range, df_values, left_on='start', right_on='day', by='uniqueID', direction='forward')
Out[17]:
uniqueID start end day value1 medium
0 1 2019-12-31 2020-01-04 2020-01-01 1.0 Online
1 2 2019-12-31 2020-01-04 2020-01-02 5.0 Shop
2 3 2019-12-31 2020-01-04 2020-01-04 12.0 Shop
3 4 2019-12-31 2020-01-04 2020-01-04 15.0 Online
4 1 2020-02-01 2020-02-05 2020-02-01 50.0 Online
5 7 2020-02-01 2020-02-05 2020-02-03 10.0 Shop
6 10 2020-02-01 2020-02-05 2020-02-03 2.0 Online
7 11 2020-02-01 2020-02-05 2020-02-05 4.0 Shop
8 3 2020-03-03 2020-03-08 2020-03-03 80.0 Online
9 4 2020-03-03 2020-03-08 NaT NaN NaN
10 7 2020-03-03 2020-03-08 2020-03-06 20.0 Shop
11 10 2020-03-03 2020-03-08 2020-03-04 45.0 Shop
但是,实际上不可能将聚合压缩到其中。
终于破解了。
由于 IntervalIndex 只能处理唯一日期,我所做的是将这些唯一 start:end 间隔及其唯一标记映射到 df_values。我犯的错误是使用整个 df_dates_ranges 作为 intervalindex 数组值,所以这只是提取唯一值的问题。我不清楚的一件事是会发生什么 when/if 任何间隔范围都有不止一个适用的 df1_tag1 值,希望它只会创建一个标签列表并且它会起作用不管怎样
请记住,在执行以下操作之前,我需要将 df_values 从长格式转换为宽格式,因为我在生成重复的 uniqueID+value 的级别上使用了 group_by
行。出于某种原因,我无法使用此处的示例数据来执行此操作,但无论如何,如果您的数据格式为 (wide/long),则以下内容应该有效,您需要它以避免 df_values 具有重复的 uniqueID+day 行。
之后我做了以下事情:
# In order to bypass the non "uniqueness" of the desired tag to apply, we create a list of the unique df1_tag1's with their respective start:end dates
df1_tag1_list = df_dates_range.groupby(["start",
"end",
"df1_tag1"]).size().reset_index().rename(columns={0:'records'})
然后,
# Create a new pandas IntervalIndex series variable to then map ("paste onto") the copied df_values using
applicable_df1_tag1 = pd.Series(df1_tag1_list["df1_tag1"].values,
pd.IntervalIndex.from_arrays(df1_tag1_list['start'],
df1_tag1_list['end']))
# map the df1_tag1 to the applicable rows in the copied df_values
df_values_with_df1_tag1["applicable_df1_tag1"] = df_values_with_df1_tag1["day"].map(applicable_df1_tag1)
这个结果应该是聚合的 df_values - 或者您在 groupby 期间执行的任何其他数学函数 - 具有现在具有映射 df1_tag1 的非重复 uniqueID+day 行然后我们可以使用合并到 df_dates_range 以及 uniqueID
希望这是对某些人有效的答案:)
编辑:也可能很重要,当我进行左合并时,我使用了以下内容以避免不必要的重复
df_date_ranges_all = df_date_ranges.merge(df_values_wide_with_df1_tag1.drop_duplicates(subset = ['uniqueID'],
keep = "last"),
how = "left",
left_on = ["uniqueID", "df1_tag1"],
right_on = ["uniqueID", "applicable_df1_tag1"],
indicator = True)