Pandas 在其他列约束下找到最小化列总和的行子集

Pandas find subset of rows minimizing the sum of a column under other column constraint

我有一个非常简单的想法,即找到使一列总和最小的行子集,而另一列的总和必须大于某个值。

示例:

df = pd.DataFrame({'Names': ['a', 'b', 'c', 'd', 'e', 'f'],
               'Target': [35, 15, 12, 8, 7, 5],
               'Cost': [15, 40, 30, 30, 25, 10]})




    Names   Target  Cost
0   a       35      15
1   b       15      40
2   c       12      30
3   d       8       30
4   e       7       25
5   f       5       10

在上面的这个例子中,我想找到最小化 Cost 列的行子集,而 Target 的总和必须大于 40。

在此示例中,我要构建的函数将 return ['a', 'f'] 因为满足约束 35 + 5 >= 40 且成本在满足约束的同时,15 + 10 = 25 不能与任何其他行组合一起降低。

我正在寻找什么库或想法来解决这个问题?

从模块导入、虚拟数据和所需阈值 40 开始。

import pandas as pd
import itertools

df = pd.DataFrame({'Names': ['a', 'b', 'c', 'd', 'e', 'f'],
               'Target': [35, 15, 12, 8, 7, 5],
               'Cost': [15, 40, 30, 30, 25, 10]})

max_value = 40 # greater than or equal to

迭代所有目标值,包括行索引。

numbers = [(index, target) for index, names, target, cost in df.itertuples()]
#print(numbers)

Select 所有超过阈值的组合和所有值的总和减去最小值小于阈值(避免包含额外的值)。

# reference: 
combinations = [(x[0] for x in seq) for i in range(len(numbers), 0, -1) \
                for seq in itertools.combinations(numbers, i) \
                if sum([x[1] for x in seq]) >= max_value and sum([x[1] for x in seq]) - min([x[1] for x in seq]) < max_value]
combinations = list(set(tuple(x) for x in combinations))
#print(combinations)

计算最小成本和 return 相关值的组合。

min_cost, min_index, min_name = None, None, None
for combination in combinations:
    sum_cost = sum([df["Cost"].loc[x] for x in combination])
    if min_cost == None or sum_cost < min_cost:
        min_cost, min_index, min_name = sum_cost, combination, [df["Names"].loc[x] for x in combination]
print("minimum cost:", min_cost)
#print("Index combination:", min_index) # if Names not unique
print("Names combination:", min_name)

输出

minimum cost: 25
Names combination: ['a', 'f']

让我们使用 itertools recipe powerset:

import pandas as pd
import numpy as np
from itertools import combinations, chain

df = pd.DataFrame(
    {
        "Names": ["a", "b", "c", "d", "e", "f"],
        "Target": [35, 15, 12, 8, 7, 5],
        "Cost": [15, 40, 30, 30, 25, 10],
    }
)


def powerset(iterable):
    "powerset([1,2,3]) --> () (1,) (2,) (3,) (1,2) (1,3) (2,3) (1,2,3)"
    s = list(iterable)
    return chain.from_iterable(combinations(s, r) for r in range(len(s) + 1))


df_out = min(
    (
        df.loc[list(i)]
        for i in powerset(df.index)
        if df.loc[list(i), "Target"].sum() >= 40
    ),
    key=lambda x: x["Cost"].sum(),
)

df_out

输出:

  Names  Target  Cost
0     a      35    15
5     f       5    10

详情:

使用 itertools recipe "powerset" 查找从零到数据帧长度的所有组合或索引。然后,使用列表理解创建一个生成器,在 returns 上目标总和为 40 或更大的那些组合。最后,使用带有 key 参数的 python min 函数来 return 成本最低的 dataframe/combination。

这是解决问题的一种蛮力方法,但它会起作用:

from itertools import combinations

df = pd.DataFrame({'Names': ['a', 'b', 'c', 'd', 'e', 'f'],
               'Target': [35, 15, 12, 8, 7, 5],
               'Cost': [15, 40, 30, 30, 25, 10]})

lst=[]
for x in range(2, (len(df.Target)+1)):
    lst+=[y for y in list(combinations(list(zip(list(df.Target), list(df.Cost))),x))]
    
lst1=[]
count=-1
for x in lst:
    count+=1
    v0=0
    v1=0
    for y in x:
        v0+=y[0]
        v1+=y[1]
    lst1.append((count,v0, v1))

df01=pd.DataFrame(lst1, columns=["ID", "target_sum", "cost_sum"])
ans=lst[df01[df01.target_sum>=40].cost_sum.idxmin()]

Answer=[df[(df.Target==ans[x][0]) & (df.Cost==ans[x][1])].Names.values[0] for x in range(0, len(ans))]

输出:

['a', 'f']

我们可以将其设置为一个包含四个部分的约束优化问题:

  1. 创建变量:我们将用布尔向量表示我们对行选择的选择。在第 k 个条目中为真意味着我们已经选择了第 k 行,如果为假则意味着我们没有选择。
  2. 指定约束:我们需要确保目标行的总和大于阈值。通过计算目标列和所选行的向量之间的点积来完成。
  3. 指定一个objective函数。这里的objective函数是所选行的成本之和,即成本列和所选行向量的点积。
  4. 解决这个问题,在这种情况下是最小化 objective 受约束的函数。

有几个 Python 运筹学库,即 OR Libraries for solving this type of problem. This solution uses Google OR-Tools 这是一个“用于优化的开源软件套件。

我们表明,使用优化包求解比对所有可能的行选择执行详尽搜索的替代解决方案要快得多。穷举搜索具有指数计算复杂度 O(2^nrows),因此仅适用于少量行(即 < 30)。

代码

import numpy as np 
import pandas as pd

# Google or-tools solver
from ortools.sat.python import cp_model

import timeit

def solve(df, threshold):
    '''
    Uses or-tools module to solve optimization

    '''
    weights = df['Target']
    cost = df['Cost']
    names = df['Names']

    # Creates the model.
    model = cp_model.CpModel()

    # Step 1: Create the variables
    # array containing row selection flags i.e. True if row k is selected, False otherwise
    # Note: treated as 1/0 in arithmeetic expressions
    row_selection = [model.NewBoolVar(f'{i}') for i in range(df.shape[0])]

    # Step 2: Define the constraints
    # The sum of the weights for the selected rows should be >= threshold
    model.Add(weights.dot(row_selection) >= threshold)

    # Step 3: Define the objective function
    # Minimize the total cost (based upon rows selected)
    model.Minimize(cost.dot(row_selection))

    # Step 4: Creates the solver and solve.
    solver = cp_model.CpSolver()
    solver.Solve(model)

    # Get the rows selected
    rows = [row for row in range(df.shape[0]) if solver.Value(row_selection[row])]

    return names[rows]


# Setup
df = pd.DataFrame({'Names': ['a', 'b', 'c', 'd', 'e', 'f'],
               'Target': [35, 15, 12, 8, 7, 5],
               'Cost': [15, 40, 30, 30, 25, 10]})

print(solve(df, 40))

# Output:
0    a
5    f
Name: Names, dtype: object

性能

当前解决方案(基于 OR-Tools)

%timeit main(df, 40)
3.13 ms ± 111 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

.

等穷举搜索算法相比
from itertools import combinations, chain
    
df = pd.DataFrame(
        {
            "Names": ["a", "b", "c", "d", "e", "f"],
            "Target": [35, 15, 12, 8, 7, 5],
            "Cost": [15, 40, 30, 30, 25, 10],
        }
    )
    
    
def powerset(iterable):
        "powerset([1,2,3]) --> () (1,) (2,) (3,) (1,2) (1,3) (2,3) (1,2,3)"
        s = list(iterable)
        return chain.from_iterable(combinations(s, r) for r in range(len(s) + 1))
    
    
%timeit min( (df.loc[list(i)] for i in powerset(df.index) if df.loc[list(i), "Target"].sum() >= 40), key=lambda x: x["Cost"].sum(),)

64.4 ms ± 1.88 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

因此,使用 OR-Tools 比穷举搜索快 20 倍(即 3.13 毫秒对 64.4 毫秒)