根据 2 个(理想情况下推广到 n 个)任意分组规则对字母字符串列表进行聚类?
Clustering of a list of strings of letters according to 2 (and ideally generalized to n) arbitrary grouping rules?
我想根据包含 n 个给定集合的 any/all/none 个字母对 n 组中可变长度的一组(字母)字符串进行排序。
例如,这里我尝试对 2 组中字母 'A,B,P,Q,X' 的所有组合进行排序,规则如下:组 1 必须包括 all/any 或 'A,P'(但不包括'B,Q'), group2 必须包含 all/any of 'B,Q'(但不包括 'A,P')。我的最终目标是构建一个列表,其中的组尽可能分开(例如开始和结束),中间的字符串不包含任何组的成员,然后是两个组的成员以及中间和极端之间的混合体。理想的顺序是:all-1/none-2,some-1/none-2,all-1/some-2,none-1-2/some-1-2,all-2/some-1,some-2/none-1,all-2/none-1.
labels_powerset = ['A','B','P','Q','X',
'AB','AP','AQ','AX','BP','BQ','BX','PQ','PX','QX',
'ABP','ABQ','ABX','APQ','APX','AQX','BPQ','BPX','BQX','PQX',
'ABPQ','ABPX','ABQX','APQX','BPQX','ABPQX']
labels_for_order = []
for length in range(1,len(max(labels_powerset,key=len))+1):
order = [label for label in labels_powerset if len(label)==length]
labels_for_order.append(order)
group1 = ['A','P']
group2 = ['B','Q']
all1 = [y for y in[[label for label in order if all(x in label for x in group1) and not any(y in label for y in group2)]
for order in labels_for_order]if y]
any1 = [y for y in[[label for label in order if any(x in label for x in group1) and not all(x in label for x in group1) and not any(y in label for y in group2)]
for order in labels_for_order]if y]
all2 = [y for y in[[label for label in order if all(x in label for x in group2) and not any(y in label for y in group1)]
for order in labels_for_order]if y]
any2 = [y for y in[[label for label in order if any(x in label for x in group2) and not all(x in label for x in group2) and not any(y in label for y in group1)]
for order in labels_for_order]if y]
none = [y for y in[[label for label in order if not any(x in label for x in group1) and not any(y in label for y in group2)]
for order in labels_for_order]if y]
both = [y for y in[[label for label in order if any(x in label for x in group1) and any(y in label for y in group2)]
for order in labels_for_order]if y]
both1 = [both[x] for x in range(0,int(len(both)/2))]
both2 = [both[x] for x in range(int(len(both)/2),len(both))]
sorted_labels = flatten(any1+all1+both1+none+both2+all2+any2)
objective 是要有一个在成员资格和字符串长度方面尽可能对称的列表。
我在编码方面还很陌生,读过一些关于 k-means 的东西,但不知道如何将它应用于字母串。
我如何更有效地做到这一点,并以一种可推广到 n groups/rules 的方式?
K-means 适用于多变量连续 数据,聚类不会尝试创建平衡组。
你应该考虑的是使用排序。
定义一个评分函数。比如每个"good"个字母给+1,每个"bad"个字母给-1,纯的加+-100。
然后根据这个分数对单词进行排序。
我想根据包含 n 个给定集合的 any/all/none 个字母对 n 组中可变长度的一组(字母)字符串进行排序。
例如,这里我尝试对 2 组中字母 'A,B,P,Q,X' 的所有组合进行排序,规则如下:组 1 必须包括 all/any 或 'A,P'(但不包括'B,Q'), group2 必须包含 all/any of 'B,Q'(但不包括 'A,P')。我的最终目标是构建一个列表,其中的组尽可能分开(例如开始和结束),中间的字符串不包含任何组的成员,然后是两个组的成员以及中间和极端之间的混合体。理想的顺序是:all-1/none-2,some-1/none-2,all-1/some-2,none-1-2/some-1-2,all-2/some-1,some-2/none-1,all-2/none-1.
labels_powerset = ['A','B','P','Q','X',
'AB','AP','AQ','AX','BP','BQ','BX','PQ','PX','QX',
'ABP','ABQ','ABX','APQ','APX','AQX','BPQ','BPX','BQX','PQX',
'ABPQ','ABPX','ABQX','APQX','BPQX','ABPQX']
labels_for_order = []
for length in range(1,len(max(labels_powerset,key=len))+1):
order = [label for label in labels_powerset if len(label)==length]
labels_for_order.append(order)
group1 = ['A','P']
group2 = ['B','Q']
all1 = [y for y in[[label for label in order if all(x in label for x in group1) and not any(y in label for y in group2)]
for order in labels_for_order]if y]
any1 = [y for y in[[label for label in order if any(x in label for x in group1) and not all(x in label for x in group1) and not any(y in label for y in group2)]
for order in labels_for_order]if y]
all2 = [y for y in[[label for label in order if all(x in label for x in group2) and not any(y in label for y in group1)]
for order in labels_for_order]if y]
any2 = [y for y in[[label for label in order if any(x in label for x in group2) and not all(x in label for x in group2) and not any(y in label for y in group1)]
for order in labels_for_order]if y]
none = [y for y in[[label for label in order if not any(x in label for x in group1) and not any(y in label for y in group2)]
for order in labels_for_order]if y]
both = [y for y in[[label for label in order if any(x in label for x in group1) and any(y in label for y in group2)]
for order in labels_for_order]if y]
both1 = [both[x] for x in range(0,int(len(both)/2))]
both2 = [both[x] for x in range(int(len(both)/2),len(both))]
sorted_labels = flatten(any1+all1+both1+none+both2+all2+any2)
objective 是要有一个在成员资格和字符串长度方面尽可能对称的列表。
我在编码方面还很陌生,读过一些关于 k-means 的东西,但不知道如何将它应用于字母串。
我如何更有效地做到这一点,并以一种可推广到 n groups/rules 的方式?
K-means 适用于多变量连续 数据,聚类不会尝试创建平衡组。
你应该考虑的是使用排序。
定义一个评分函数。比如每个"good"个字母给+1,每个"bad"个字母给-1,纯的加+-100。
然后根据这个分数对单词进行排序。