选择具有特定阈值的重叠簇

Question

说，我已经对我的数据集进行了聚类并且有 10 个聚类。这些集群是不重叠的。但现在假设我更改了所有数据点中的某些特征并再次进行聚类。现在我还有 10 个集群。如果我再重复 3 次，最后我会有 50 个簇。每个集群都有一个与之关联的分数，该分数是根据其组成数据点计算得出的。

这 50 个集群现在有重叠的数据点。我想 select 从这 50 个允许一定重叠阈值的集群中 select 所有可能的集群，以获得 selected 集群的最高总分。

一种方法是贪心法，我根据分数从高到低对聚类进行排序。然后 select 得分最高的集群。然后从那里保留 selecting 集群，这些集群在阈值内与已经 selected 的集群重叠。但是虽然速度很快，但似乎不是最优解。

示例：假设我有 3 个具有以下分数的集群：

C1 = (A,B,C,D,E,F) 分数 = 10

C2 = (A,B,C,D) 分数 = 6

C3 = (D,E,F) 得分 = 6

允许的重叠为 1 个元素或小于较小簇大小的 40%。

贪婪的方法将return{C1}总分10分，而更好的选择是{C2,C3}总分6+6=12，重叠元素'D'，即 1/size(C3) = 1/3 = 33.33% < 40%

我正在寻找另一种可以提供最优解或比上述贪婪方法更好的解的方法。

Answer 1

下面link给出了您问题的无约束版本的答案：

您可以在上述 link 编码的模型中添加新的约束，用于检查所选集群之间的重叠并通过允许的阈值对其进行限制。

这是上述问题的 python 代码：

from gurobipy import *
import string
# List of all subtomograms
data_points = string.ascii_uppercase[:6]
data_points = list(data_points)

# Clusters as list of lists, where each list is list of subtomograms
clusters = []
clusters.append(['A', 'B', 'C', 'D', 'E', 'F'])
clusters.append(['A', 'B', 'C', 'D'])
clusters.append(['D', 'E', 'F'])

# Create a matrix: num_subtomograms x num_clusters
matrix = {}
for dp in data_points:
    matrix[dp] = [0]*len(clusters)

# Make matrix[subtomogram_i][cluster_i] = 1, if subtomogram_i is present in cluster_i
for i in range(0, len(clusters)):
    for dp in clusters[i]:
        matrix[dp][i] = 1

# Score of each cluster in the same order as used in matrix
cost = [10, 6, 6]

# Gurobi MIP model
m = Model("Cluster selection optimization")
m.params.outputflag = 1
m.params.method = 2 # for barrier method in Gurobi, it is used to solve quadratic programming problems

# Adding a variable x where x[i] will represent whether or not ith cluster is selected or not
x = m.addVars(len(clusters), vtype=GRB.BINARY, name='x')

# generate objective function: score[0]x[0] + score[1]x[1] .....
indices = range(0, len(clusters))
coef_x = dict()
obj = 0.0
for i in indices:
    coef_x[i] = cost[i]
    obj += coef_x[i] * x[i]
m.setObjective(obj, GRB.MAXIMIZE)

# Generate constraints
threshhold = 0.4 # 40% threshold set
count = 0
m_sum = []
for i in range(len(clusters)):
    m_sum.append(sum([matrix[k][i] for k in data_points]))
for i in range(len(clusters)):
    for j in range(i+1, len(clusters)):
        if i==j:
            continue
        tmp = (sum([matrix[k][i]*matrix[k][j] for k in data_points])*x[i]*x[j] <= threshhold*min(m_sum[i], m_sum[j]))
        m.addConstr(tmp, "C"+str(count))
        count += 1

# Optimize
m.optimize()
print("Optimized")

以上运行的结果和日志数据为：

Parameter outputflag unchanged
   Value: 1  Min: 0  Max: 1  Default: 1
Changed value of parameter method to 2
   Prev: -1  Min: -1  Max: 5  Default: -1
Optimize a model with 0 rows, 3 columns and 0 nonzeros
Model has 3 quadratic constraints
Variable types: 0 continuous, 3 integer (3 binary)
Coefficient statistics:
  Matrix range     [0e+00, 0e+00]
  QMatrix range    [1e+00, 4e+00]
  Objective range  [6e+00, 1e+01]
  Bounds range     [1e+00, 1e+00]
  RHS range        [0e+00, 0e+00]
  QRHS range       [1e+00, 2e+00]
Found heuristic solution: objective -0.0000000
Modified 2 Q diagonals
Modified 2 Q diagonals
Presolve time: 0.00s
Presolved: 0 rows, 3 columns, 0 nonzeros
Variable types: 0 continuous, 3 integer (3 binary)
Presolve removed 0 rows and 3 columns
Presolve: All rows and columns removed

Root relaxation: objective 2.200000e+01, 0 iterations, 0.00 seconds

    Nodes    |    Current Node    |     Objective Bounds      |     Work
 Expl Unexpl |  Obj  Depth IntInf | Incumbent    BestBd   Gap | It/Node Time

*    0     0               0      12.0000000   12.00000  0.00%     -    0s

Explored 0 nodes (2 simplex iterations) in 0.01 seconds
Thread count was 32 (of 32 available processors)

Solution count 2: 12 -0 

Optimal solution found (tolerance 1.00e-04)
Best objective 1.200000000000e+01, best bound 1.200000000000e+01, gap 0.0000%
Optimized
Final Obj: 12.0
1
2

还有其他方法可以解决它，例如人工智能方法（爬山法，模拟退火法等），遗传算法等进化优化方法（您可以根据自己的问题修改后使用NSGA2，代码可在http://www.iitk.ac.in/kangal/codes.shtml)

选择具有特定阈值的重叠簇

Selecting Overlapping Clusters with certain threshold

optimization

cluster-analysis

nonlinear-optimization

threshold