根据总和为目标值的最接近组合加入两个数据帧

Join two dataframes based on closest combination that sums up to a target value

我正在尝试根据 df2Sales 中最接近的行组合连接到两个数据帧下方,总计为 df1[ 中的目标值=47=] 列 Total Sales、列 NameDate 在连接时两个数据框中的列应该相同(如预期输出所示)。

例如:在 df1 中,行号 0 应仅与 df2 行 0 和 1 匹配,因为列 Name & Date 相同,即姓名:John 和日期:2021-10-01.

df1 :

df1 = pd.DataFrame({"Name":{"0":"John","1":"John","2":"Jack","3":"Nancy","4":"Ahmed"},
                    "Date":{"0":"2021-10-01","1":"2021-11-01","2":"2021-10-10","3":"2021-10-12","4":"2021-10-30"},
                    "Total Sales":{"0":15500,"1":5500,"2":17600,"3":20700,"4":12000}})

    Name    Date        Total Sales
0   John    2021-10-01  15500
1   John    2021-11-01  5500
2   Jack    2021-10-10  17600
3   Nancy   2021-10-12  20700
4   Ahmed   2021-10-30  12000

df2 :

df2 = pd.DataFrame({"ID":{"0":"JO1","1":"JO2","2":"JO3","3":"JO4","4":"JA1","5":"JA2","6":"NA1",
                          "7":"NA2","8":"NA3","9":"NA4","10":"AH1","11":"AH2","12":"AH3","13":"AH3"},
                    "Name":{"0":"John","1":"John","2":"John","3":"John","4":"Jack","5":"Jack","6":"Nancy","7":"Nancy",
                            "8":"Nancy","9":"Nancy","10":"Ahmed","11":"Ahmed","12":"Ahmed","13":"Ahmed"},
                    "Date":{"0":"2021-10-01","1":"2021-10-01","2":"2021-11-01","3":"2021-11-01","4":"2021-10-10","5":"2021-10-10","6":"2021-10-12","7":"2021-10-12",
                            "8":"2021-10-12","9":"2021-10-12","10":"2021-10-30","11":"2021-10-30","12":"2021-10-30","13":"2021-10-29"},
                    "Sales":{"0":10000,"1":5000,"2":1000,"3":5500,"4":10000,"5":7000,"6":20000,
                             "7":100,"8":500,"9":100,"10":5000,"11":7000,"12":10000,"13":12000}})

    ID  Name    Date        Sales
0   JO1 John    2021-10-01  10000
1   JO2 John    2021-10-01  5000
2   JO3 John    2021-11-01  1000
3   JO4 John    2021-11-01  5500
4   JA1 Jack    2021-10-10  10000
5   JA2 Jack    2021-10-10  7000
6   NA1 Nancy   2021-10-12  20000
7   NA2 Nancy   2021-10-12  100
8   NA3 Nancy   2021-10-12  500
9   NA4 Nancy   2021-10-12  100
10  AH1 Ahmed   2021-10-30  5000
11  AH2 Ahmed   2021-10-30  7000
12  AH3 Ahmed   2021-10-30  10000
13  AH3 Ahmed   2021-10-29  12000

预期输出:

    Name    Date        Total Sales Comb IDs            Comb Total
0   John    2021-10-01  15500       JO1, JO2            15000.0
1   John    2021-11-01  5500        JO4                 5500.0
2   Jack    2021-10-10  17600       JA1, JA2            17000.0
3   Nancy   2021-10-12  20700       NA1, NA2, NA3, NA4  20700.0
4   Ahmed   2021-10-30  12000       AH1, AH2            12000.0

我在下面尝试的是一次只处理一行,但我不确定如何在 pandas 数据帧中应用它以获得预期的输出。

下面脚本中的变量 numbers 表示 df2 中的 Sales 列,下面的变量 target 表示 Total Sales 中的列df1.

import itertools
import math

numbers = [1000, 5000, 3000]
target = 6000

best_combination = ((None,))
best_result = math.inf
best_sum = 0

for L in range(0, len(numbers)+1):
    for combination in itertools.combinations(numbers, L):
        sum = 0
        for number in combination:
            sum += number
        result = target - sum
        if abs(result) < abs(best_result):
            best_result = result
            best_combination = combination
            best_sum = sum

print("\nbest sum{} = {}".format(best_combination, best_sum))


[Out] best sum(1000, 5000) = 6000

将您编写的找到最佳总和的代码转化为一个函数(我们称它为 opt,它具有目标参数和数据框(它将是 [=11= 的子集) ].需要return一个最优组合对应的ID列表

编写另一个函数,它接受 3 个参数名称、日期和目标(我们称之为 calc)。此函数将根据名称和日期过滤 df2,并将其与目标一起传递给 opt 函数,并 return 该函数的结果。最后,遍历 df1 的行,并使用行参数调用 calc(或者使用 pandas.DataFrame.apply