根据 2 个(理想情况下推广到 n 个)任意分组规则对字母字符串列表进行聚类?

Clustering of a list of strings of letters according to 2 (and ideally generalized to n) arbitrary grouping rules?

我想根据包含 n 个给定集合的 any/all/none 个字母对 n 组中可变长度的一组(字母)字符串进行排序。

例如,这里我尝试对 2 组中字母 'A,B,P,Q,X' 的所有组合进行排序,规则如下:组 1 必须包括 all/any 或 'A,P'(但不包括'B,Q'), group2 必须包含 all/any of 'B,Q'(但不包括 'A,P')。我的最终目标是构建一个列表,其中的组尽可能分开(例如开始和结束),中间的字符串不包含任何组的成员,然后是两个组的成员以及中间和极端之间的混合体。理想的顺序是:all-1/none-2,some-1/none-2,all-1/some-2,none-1-2/some-1-2,all-2/some-1,some-2/none-1,all-2/none-1.

labels_powerset = ['A','B','P','Q','X',
    'AB','AP','AQ','AX','BP','BQ','BX','PQ','PX','QX',
    'ABP','ABQ','ABX','APQ','APX','AQX','BPQ','BPX','BQX','PQX',
    'ABPQ','ABPX','ABQX','APQX','BPQX','ABPQX']

labels_for_order = []

for length in range(1,len(max(labels_powerset,key=len))+1):
    order = [label for label in labels_powerset if len(label)==length]
    labels_for_order.append(order)

group1 = ['A','P']
group2 = ['B','Q']

all1 = [y for y in[[label for label in order if all(x in label for x in group1) and not any(y in label for y in group2)]
        for order in labels_for_order]if y]

any1 = [y for y in[[label for label in order if any(x in label for x in group1) and not all(x in label for x in group1) and not any(y in label for y in group2)]
        for order in labels_for_order]if y]

all2 = [y for y in[[label for label in order if all(x in label for x in group2) and not any(y in label for y in group1)]
        for order in labels_for_order]if y]

any2 = [y for y in[[label for label in order if any(x in label for x in group2) and not all(x in label for x in group2) and not any(y in label for y in group1)]
        for order in labels_for_order]if y]

none = [y for y in[[label for label in order if not any(x in label for x in group1) and not any(y in label for y in group2)]
        for order in labels_for_order]if y]

both = [y for y in[[label for label in order if any(x in label for x in group1) and any(y in label for y in group2)]
        for order in labels_for_order]if y]

both1 = [both[x] for x in range(0,int(len(both)/2))]

both2 = [both[x] for x in range(int(len(both)/2),len(both))]

sorted_labels = flatten(any1+all1+both1+none+both2+all2+any2)

objective 是要有一个在成员资格和字符串长度方面尽可能对称的列表。

我在编码方面还很陌生,读过一些关于 k-means 的东西,但不知道如何将它应用于字母串。

我如何更有效地做到这一点,并以一种可推广到 n groups/rules 的方式?

K-means 适用于多变量连续 数据,聚类不会尝试创建平衡组。

你应该考虑的是使用排序

定义一个评分函数。比如每个"good"个字母给+1,每个"bad"个字母给-1,纯的加+-100。

然后根据这个分数对单词进行排序。