按 2 个日期间隔列和 2 个其他列对 ID 进行分组

Question

我有以下数据框：

ID	Fruit	Price	Location	Start_Date	End_Date
01	Orange	12	ABC	01-03-2015	01-05-2015
01	Orange	9.5	ABC	01-03-2015	01-05-2015
02	Apple	10	PQR	04-09-2019	04-11-2019
06	Orange	11	ABC	01-04-2015	01-06-2015
05	Peach	15	XYZ	07-11-2021	07-13-2021
08	Apple	10.5	PQR	04-09-2019	04-11-2019
10	Apple	10	LMN	04-10-2019	04-12-2019
03	Peach	14.5	XYZ	07-11-2020	07-13-2020
11	Peach	12.5	ABC	01-04-2015	01-05-2015
12	Peach	12.5	ABC	01-03-2015	01-05-2015

我想组成一组属于相同位置、水果、开始日期和结束日期范围的ID。日期间隔条件是我们只把start_date和end_date相隔不超过3天的id分组在一起。例如。 ID 06 start_date 是 01-04-2015，end_date 是 01-06-2015。 ID 01 start_date 是 01-03-2015，end_date 是 01-05-2015。因此 ID 06 和 01 的 start_date 和 end_date 仅相隔 1 天，因此合并是可以接受的（即如果位置和水果等其他变量匹配，这两个 ID 可以组合在一起）。

此外，我只想输出具有 1 个以上唯一 ID 的组。

我的输出应该是（合并开始日期和结束日期）：

ID	Fruit	Price	Location	Start_Date	End_Date
01	Orange	12	ABC	01-03-2015	01-06-2015
01	Orange	9.5
06	Orange	11
11	Peach	12.5
12	Peach	12.5
02	Apple	10	PQR	04-09-2019	04-11-2019
08	Apple	10.5

ID 05,03 被过滤掉，因为它是一条记录（它们不满足日期间隔条件）。 ID 10 被过滤掉，因为它来自不同的位置。

我不知道如何合并 2 个这样的日期列的间隔。我已经尝试了一些技术来测试分组（没有日期合并）。

我最近用的是石斑鱼。

output = df.groupby([pd.Grouper(key='Start_Date', freq='D'),pd.Grouper(key='End_Date', freq='D'),pd.Grouper(key='Location'),pd.Grouper(key='Fruit'),'ID']).agg(unique_emp=('ID', 'nunique'))

需要帮助获取输出。谢谢！！

Answer 1

这是一种 slow/non-vectorized 方法，我们“手动”遍历排序的日期值并将它们分配给 bin，当差距太大时递增到下一个 bin。使用函数将新列添加到 df。编辑后 ID 列是索引

from datetime import timedelta
import pandas as pd

#Setup
df = pd.DataFrame(
    columns = ['ID', 'Fruit', 'Price', 'Location', 'Start_Date', 'End_Date'],
    data = [
        [1, 'Orange', 12.0, 'ABC', '01-03-2015', '01-05-2015'],
        [1, 'Orange', 9.5, 'ABC', '01-03-2015', '01-05-2015'],
        [2, 'Apple', 10.0, 'PQR', '04-09-2019', '04-11-2019'],
        [6, 'Orange', 11.0, 'ABC', '01-04-2015', '01-06-2015'],
        [5, 'Peach', 15.0, 'XYZ', '07-11-2021', '07-13-2021'],
        [8, 'Apple', 10.5, 'PQR', '04-09-2019', '04-11-2019'],
        [10, 'Apple', 10.0, 'LMN', '04-10-2019', '04-12-2019'],
        [3, 'Peach', 14.5, 'XYZ', '07-11-2020', '07-13-2020'],
        [11, 'Peach', 12.5, 'ABC', '01-04-2015', '01-05-2015'],
        [12, 'Peach', 12.5, 'ABC', '01-03-2015', '01-05-2015'],
    ]
)

df['Start_Date'] = pd.to_datetime(df['Start_Date'])
df['End_Date'] = pd.to_datetime(df['End_Date'])
df = df.set_index('ID')

#Function to bin the dates
def create_date_bin_series(dates, max_span=timedelta(days=3)):
    
    orig_order = zip(dates,range(len(dates)))
    sorted_order = sorted(orig_order)
    
    curr_bin = 1
    curr_date = min(dates)
    
    date_bins = []
    for date,i in sorted_order:
        if date-curr_date > max_span:
            curr_bin += 1

        curr_date = date
        date_bins.append((curr_bin,i))
    
    #sort the date_bins to match the original order
    date_bins = [v for v,_ in sorted(date_bins, key = lambda x: x[1])]
    return date_bins
        
#Apply function to group each date into a bin with other dates within 3 days of it
start_bins = create_date_bin_series(df['Start_Date'])
end_bins = create_date_bin_series(df['End_Date'])

#Group by new columns
df['fruit_group'] = df.groupby(['Fruit','Location',start_bins,end_bins]).ngroup()

#Print the table sorted by these new groups
print(df.sort_values('fruit_group'))

#you can use the new fruit_group column to filter and agg etc

输出

Answer 2

这本质上是一个 gap-and-island 问题。如果按水果、位置和开始日期对数据框进行排序，则可以按如下方式创建岛屿（即水果组）：

如果当前行的Fruit或Location与上一行的不一样，开始一个新岛
如果当前行的结束日期比岛的开始日期晚 3 天以上，则创建一个新岛

代码：

for col in ["Start_Date", "End_Date"]:
    df[col] = pd.to_datetime(df[col])

# This algorithm requires a sorted dataframe
df = df.sort_values(["Fruit", "Location", "Start_Date"])

# Assign each row to an island
i = 0
islands = []
last_fruit, last_location, last_start = None, None, df["Start_Date"].iloc[0]

for _, (fruit, location, start, end) in df[["Fruit", "Location", "Start_Date", "End_Date"]].iterrows():
    if (fruit != last_fruit) or (location != last_location) or (end - last_start > pd.Timedelta(days=3)):
        i += 1
        last_fruit, last_location, last_start = fruit, location, start
    else:
        last_fruit, last_location = fruit, location
    islands.append(i)

df["Island"] = islands

# Filter for islands having more than 1 rows
idx = pd.Series(islands).value_counts().loc[lambda c: c > 1].index
df[df["Island"].isin(idx)]

按 2 个日期间隔列和 2 个其他列对 ID 进行分组

Group ids by 2 date interval columns and 2 other columns

dataframe

pandas

pandas-groupby