对列表进行排序以最大化相邻项目之间的相似性
Sorting a list to maximize similarity between neighboring items
我有一个键列表:
['A', 'B', 'C']
对于这些键中的每一个,都有一个属性列表:
{
'A': [2,3],
'B': [1,2],
'C': [4]
}
我希望对标签列表进行排序,使相邻标签共享尽可能多的属性。
在上面的示例中,A
和 B
共享关系 2
,因此它们应该彼此相邻 - 而 C
与它们没有共享关系,因此它可以去任何地方。
因此此示例的可能顺序如下:
["A","B","C"] # acceptable
["A","C","B"] # NOT acceptable
["B","A","C"] # acceptable
["B","C","A"] # NOT acceptable
["C","A","B"] # acceptable
["C","B","A"] # acceptable
桶
实际上我更希望将它们放入 "buckets":
中来表示
[["A", "B"], ["C"]] # this can represent all four possible orders above.
但是,如果一个标签属于两个不同的桶,这就会出现问题:
{
'A': [2,3],
'B': [1,2],
'C': [1,4]
}
我该如何表示?
我可以这样说:
[["A", "B"], ["C", "B"]]
但是我需要另一个处理步骤来转换桶列表
进入最终表示:
["A", "B", "C"]
在上面可以有递归嵌套的桶:
[[["A","B"], ["C"]], ["D"]]
然后这些可能重叠:
[[["A","B"], ["C"]], ["A","D"]]
质量
"closeness"即一个解的质量定义为邻居关系的交集之和(质量越高越好):
def measurequality(result,mapping):
lastKey = None
quality = 0
for key in result:
if lastKey is None:
lastKey = key
continue
quality += len(set(mapping[key]).intersection(mapping[lastKey]))
lastKey = key
return quality
# Example determining that the solution ['A', 'B', 'C'] has quality 1:
#measurequality(['A', 'B', 'C'],
# {
# 'A': [2,3],
# 'B': [1,2],
# 'C': [4]
# })
暴力破解
暴力破解不构成解决方案(实际上列表包含数千个元素的顺序 - 但是,如果有人有比 O(n²)
更好的暴力破解方法...) .
但是,使用暴力破解创建额外的测试用例是可能的:
- 生成
L
个 n
项的列表 ['A','B','C',...]
- 为每个项目生成一个关系字典
R
(最多 n
个介于 0
和 n
之间的随机数应该足够了)。
- 生成
L
的所有可能排列,并将它们与 R
一起送入 measurequality()
并保留具有最大 return 值的排列(可能不是唯一的)。
用于创建随机测试用例以测试实现的代码:
import string
import random
def randomtestcase(n):
keys=list(string.ascii_uppercase[0:n])
minq = 0
maxq = 0
while minq == maxq:
items={}
for key in keys:
items[key] = random.sample(range(1,10),int(random.random()*10))
minq = n*n
minl = list(keys)
maxq = 0
maxl = list(keys)
for _ in range(0, 1000): # TODO: explicitly construct all possible permutations of keys.
random.shuffle(keys)
q = measurequality(keys,items)
if q < minq:
minq = q
minl = list(keys)
if maxq < q:
maxq = q
maxl = list(keys)
return ( items, minl, maxq )
( items, keys, quality ) = randomtestcase(5)
sortedkeys = dosomething( keys, items )
actualquality = measurequality( sortedkeys, items )
if actualquality < quality:
print('Suboptimal: quality {0} < {1}'.format(actualquality,quality))
尝试
许多 "solutions" 中的一个不起作用(非常糟糕,这个没有初始元素的选择/在我在其他结果列表中添加和附加到结果列表之间的选择) :
def dosomething(keys,items):
result = []
todo = list(keys)
result.append(todo.pop())
while any(todo):
lastItems = set(items[result[-1]])
bestScore = None
bestKey = None
for key in todo:
score = set(items[key]).intersection(lastItems)
if bestScore is None or bestScore < score:
bestScore = score
bestKey = key
todo.remove(bestKey)
result.append(bestKey)
return result
示例
(另请查看上面暴力破解部分中的示例生成器。)
正在尝试一些示例的测试代码:
def test(description,acceptable,keys,arguments):
actual = dosomething(keys,arguments)
if "".join(actual) in acceptable:
return 0
print("\n[{0}] {1}".format("".join(keys),description))
print("Expected: {0}\nBut was: {1}".format(acceptable,actual))
print("Quality of result: {0}".format(measurequality(actual,arguments)))
print("Quality of expected: {0}".format([measurequality(a,arguments) for a in acceptable]))
return 1
print("EXAMPLES")
failures = 0
# Need to try each possible ordering of letters to ensure that the order of keys
# wasn't accidentially already a valid ordering.
for keys in [
["A","B","C"],
["A","C","B"],
["B","A","C"],
["B","C","A"],
["C","A","B"],
["C","B","A"]
]:
failures += test(
"1. A and B both have 2, C doesn't, so C can go first or last but not in between.",
["ABC", "BAC", "CAB", "CBA"],
keys,
{
"A": [2,3],
"B": [1,2],
"C": [4]
})
failures += test(
"2. They all have 2, so they can show up in any order.",
["ABC", "ACB", "BAC", "BCA", "CAB", "CBA"],
keys,
{
"A": [2,3],
"B": [1,2],
"C": [2]
})
failures += test(
"3. A and B share 2, B and C share 1, so B must be in the middle.",
["ABC", "CBA"],
keys,
{
"A": [2,3],
"B": [1,2],
"C": [1]
})
failures += test(
"4. Each shares something with each other, creating a cycle, so they can show up in any order.",
["ABC", "ACB", "BAC", "BCA", "CAB", "CBA"],
keys,
{
"A": [2,3],
"B": [1,2],
"C": [1,3]
})
if 0 < failures:
print("{0} FAILURES".format(failures))
优先级
正如所问:用于关系的数字没有优先顺序。存在优先顺序,但它是部分顺序而不是数字之一。我只是没有提到它,因为它使问题变得更难。
所以给出这个例子:
{
'A': [2,3],
'B': [1,2],
'C': [4]
}
可能会替换为以下内容(使用字母代替数字并添加优先信息):
{
'A': [('Q',7),('R',5)],
'B': [('P',6),('Q',6)],
'C': [('S',5)]
}
注意
- 优先级仅在一个列表内有意义,在列表之间没有意义。
- 共享关系的优先级在两个列表之间可能不同。
- 在一个列表中可能有多次相同的优先级。
我正在修改你的测试程序并想出了这个解决方案,它给了我 0 个失败。虽然感觉像是一种启发式方法,但它肯定需要更多的测试和测试用例。该函数假定键是唯一的,因此 ['A', 'A', 'B', ...]
列表和 arguments
字典中的所有元素都存在:
def dosomething(_, arguments):
m = {}
for k, v in arguments.items():
for i in v:
m.setdefault(i, []).append(k)
out, seen = [], set()
for _, values in sorted(m.items(), key=lambda k: -len(k[1])):
for v in values:
if v not in seen:
out.append(v)
seen.add(v)
return out
这是一个 Travelling Salesman Problem,一个众所周知的难以最佳解决的问题。所提供的代码在大约 15 分钟内解决了 10,000 个具有简单互连的节点(即每个节点有一个或两个关系)。对于相互关联更丰富的序列,它的表现不太好。这在下面的测试结果中进行了探讨。
OP 提到的优先级概念未被探讨。
所提供的代码包含一个 heuristic 解决方案,一个对大型 node_set
s 是最佳但不实用的蛮力解决方案,以及一些简单但可扩展的测试数据生成器,其中一些已知最优解。启发式为 OP 的 'ABC' 示例、我自己的 8 项 node_set
以及已知最佳解决方案的可扩展测试数据找到最佳解决方案。
如果性能不够好,至少它是一个初步的努力,并且开始测试 'workshop' 以改进启发式。
测试结果
>>> python sortbylinks.py
===============================
"ABC" (sequence length 3)
===============================
Working...
...choosing seed candidates
Finding heuristic candidates...
Number of candidates with top score: 3
[('A', 'B', 'C'), ('C', 'A', 'B'), ('B', 'A', 'C')], ...and more
Top score: 1
"optimise_order" function took 0.0s
Optimal quality: 1
===============================
"Nick 8-digit" (sequence length 8)
===============================
Working...
...choosing seed candidates
Finding heuristic candidates...
Number of candidates with top score: 1
[('A', 'E', 'F', 'C', 'D', 'G', 'B', 'H')], ...and more
Top score: 6
"optimise_order" function took 0.0s
Optimal quality: 6
简短、相对琐碎的案例似乎没有问题。
===============================
"Quality <1000, cycling subsequence, small number of relations" (sequence length 501)
===============================
Working...
...choosing seed candidates
Finding heuristic candidates...
Number of candidates with top score: 3
[('AAAC', 'AAAL', 'AAAU', ...), ...], ...and more
Top score: 991
"optimise_order" function took 2.0s
Optimal quality: 991
===============================
"Quality <1000, cycling subsequence, small number of relations, shuffled" (sequence length 501)
===============================
Working...
...choosing seed candidates
Finding heuristic candidates...
Number of candidates with top score: 3
[('AADF', 'AAKM', 'AAOZ', ...), ...], ...and more
Top score: 991
"optimise_order" function took 2.0s
Optimal quality: 991
"Quality <1000, cycling subsequence" (sequence length 501)
很有趣。通过将具有 {0, 1}
关系集的节点分组,质量得分几乎可以翻倍。启发式算法会找到这个最佳序列。质量 1000 不太可能,因为这些双 linked 组需要经常通过单 linked 节点相互连接(例如 {'AA': {0, 1}, 'AB': {0, 1}, ..., 'AZ': {0, 1}, <...single link here...> 'BA': {1, 2}, 'BB': {1, 2}, ...}
)。
这个测试数据的性能仍然很好,每个节点的关系很少。
"Quality 400, single unbroken chain, initial solution is optimal" (sequence length 401)
===============================
Working...
Finding heuristic candidates...
Number of candidates with top score: 1
[('AAAA', 'AAAB', 'AAAC', ...), ...], ...and more
Top score: 400
"optimise_order" function took 0.0s
Optimal quality: 400
===============================
"Quality 400, single unbroken chain, shuffled" (sequence length 401)
===============================
Working...
...choosing seed candidates
Finding heuristic candidates...
Number of candidates with top score: 1
[('AAAA', 'AAAB', 'AAAC', ...), ...], ...and more
Top score: 400
"optimise_order" function took 0.0s
Optimal quality: 400
旅行商问题 (TSP) 的难点之一是知道何时有最佳解决方案。即使从接近最优或最优的开始,启发式算法似乎也没有收敛得更快。
===============================
"10,000 items, single unbroken chain, initial order is optimal" (sequence length 10001)
===============================
Working...
Finding heuristic candidates...
Number of candidates with top score: 1
[('AOUQ', 'AAAB', 'AAAC', ...), ...], ...and more
Top score: 10002
"optimise_order" function took 947.0s
Optimal quality: 10000
当关系的数量很少时,即使有很多节点,性能也很好,结果可能接近最优。
===============================
"Many random relations per node (0..n), n=200" (sequence length 200)
===============================
Working...
...choosing seed candidates
Finding heuristic candidates...
Number of candidates with top score: 1
[('AAEO', 'AAHC', 'AAHQ', ...), ...], ...and more
Top score: 6861
"optimise_order" function took 94.0s
Optimal quality: ?
===============================
"Many random relations per node (0..n), n=500" (sequence length 500)
===============================
Working...
...choosing seed candidates
Finding heuristic candidates...
Number of candidates with top score: 1
[('AAJT', 'AAHU', 'AABT', ...), ...], ...and more
Top score: 41999
"optimise_order" function took 4202.0s
Optimal quality: ?
这更像是 OP 生成的数据,也更像是经典的旅行商问题 (TSP),其中每个城市对之间有一组距离('city' 读作 'node') 和节点之间通常有丰富的连接。在这种情况下,节点之间的 links 是部分的——不能保证任何 2 个节点之间的 link。
在这种情况下,启发式算法的时间性能要差得多。对于 n
个节点,每个节点有 0 到 n
个随机关系。这可能意味着更多的交换组合会产生更高的质量,交换和质量检查本身的成本更高,并且在启发式收敛于其最佳结果之前需要更多的遍数。在最坏的情况下,这可能意味着 O(n^3)。
随着节点和关系数量的增加,性能会下降(注意 n=200
-- 3 分钟-- 和 n=500
-- 70 分钟之间的区别。)所以目前启发式可能不是适用于数千个高度互连的节点。
此外,无法准确知道此测试结果的质量,因为蛮力解决方案在计算上不可行。 6861 / 200 = 34.3
和 41999 / 500 = 84.0
节点对之间的平均连接看起来相差不大。
启发式和蛮力求解器代码
import sys
from collections import deque
import itertools
import random
import datetime
# TODO: in-place swapping? (avoid creating copies of sequences)
def timing(f):
"""
decorator for displaying execution time for a method
"""
def wrap(*args):
start = datetime.datetime.now()
f_return_value = f(*args)
end = datetime.datetime.now()
print('"{:s}" function took {:.1f}s'.format(f.__name__, (end-start).seconds))
return f_return_value
return wrap
def flatten(a):
# e.g. [a, [b, c], d] -> [a, b, c, d]
return itertools.chain.from_iterable(a)
class LinkAnalysis:
def __init__(self, node_set, max_ram=100_000_000, generate_seeds=True):
"""
:param node_set: node_ids and their relation sets to be arranged in sequence
:param max_ram: estimated maximum RAM to use
:param generate_seeds: if true, attempt to generate some initial candidates based on sorting
"""
self.node_set = node_set
self.candidates = {}
self.generate_seeds = generate_seeds
self.seeds = {}
self.relations = []
# balance performance and RAM using regular 'weeding'
candidate_size = sys.getsizeof(deque(self.node_set.keys()))
self.weed_interval = max_ram // candidate_size
def create_initial_candidates(self):
print('Working...')
self.generate_seed_from_presented_sequence()
if self.generate_seeds:
self.generate_seed_candidates()
def generate_seed_from_presented_sequence(self):
"""
add initially presented order of nodes as one of the seed candidates
this is worthwhile because the initial order may be close to optimal
"""
presented_sequence = self.presented_sequence()
self.seeds[tuple(self.presented_sequence())] = self.quality(presented_sequence)
def presented_sequence(self) -> list:
return list(self.node_set.keys()) # relies on Python 3.6+ to preserve key order in dicts
def generate_seed_candidates(self):
initial_sequence = self.presented_sequence()
# get relations from the original data structure
relations = sorted(set(flatten(self.node_set.values())))
# sort by lowest precedence relation first
print('...choosing seed candidates')
for relation in reversed(relations):
# use true-false ordering: in Python, True > False
initial_sequence.sort(key=lambda sortkey: not relation in self.node_set[sortkey])
sq = self.quality(initial_sequence)
self.seeds[tuple(initial_sequence)] = sq
def quality(self, sequence):
"""
calculate quality of full sequence
:param sequence:
:return: quality score (int)
"""
pairs = zip(sequence[:-1], sequence[1:])
scores = [len(self.node_set[a].intersection(self.node_set[b]))
for a, b in pairs]
return sum(scores)
def brute_force_candidates(self, sequence):
for sequence in itertools.permutations(sequence):
yield sequence, self.quality(sequence)
def heuristic_candidates(self, seed_sequence):
# look for solutions with higher quality scores by swapping elements
# start with large distances between elements
# then reduce by power of 2 until swapping next-door neighbours
max_distance = len(seed_sequence) // 2
max_pow2 = int(pow(max_distance, 1/2))
distances = [int(pow(2, r)) for r in reversed(range(max_pow2 + 1))]
for distance in distances:
yield from self.seed_and_variations(seed_sequence, distance)
# seed candidate may be better than its derived sequences -- include it as a candidate
yield seed_sequence, self.quality(seed_sequence)
def seed_and_variations(self, seed_sequence, distance=1):
# swap elements at a distance, starting from beginning and end of the
# sequence in seed_sequence
candidate_count = 0
for pos1 in range(len(seed_sequence) - distance):
pos2 = pos1 + distance
q = self.quality(seed_sequence)
# from beginning of sequence
yield self.swap_and_quality(seed_sequence, q, pos1, pos2)
# from end of sequence
yield self.swap_and_quality(seed_sequence, q, -pos1, -pos2)
candidate_count += 2
if candidate_count > self.weed_interval:
self.weed()
candidate_count = 0
def swap_and_quality(self, sequence, preswap_sequence_q: int, pos1: int, pos2: int) -> (tuple, int):
"""
swap and return quality (which can easily be calculated from present quality
:param sequence: as for swap
:param pos1: as for swap
:param pos2: as for swap
:param preswap_sequence_q: quality of pre-swapped sequence
:return: swapped sequence, quality of swapped sequence
"""
initial_node_q = sum(self.node_quality(sequence, pos) for pos in [pos1, pos2])
swapped_sequence = self.swap(sequence, pos1, pos2)
swapped_node_q = sum(self.node_quality(swapped_sequence, pos) for pos in [pos1, pos2])
qdelta = swapped_node_q - initial_node_q
swapped_sequence_q = preswap_sequence_q + qdelta
return swapped_sequence, swapped_sequence_q
def swap(self, sequence, node_pos1: int, node_pos2: int):
"""
deques perform better than lists for swapping elements in a long sequence
:param sequence-- sequence on which to perform the element swap
:param node_pos1-- zero-based position of first element
:param pos2--: zero-based position of second element
>>> swap(('A', 'B', 'C'), 0, 1)
('B', 'A', 'C')
"""
if type(sequence) is tuple:
# sequence is a candidate (which are dict keys and hence tuples)
# needs converting to a list for swap processing
sequence = list(sequence)
if node_pos1 == node_pos2:
return sequence
tmp = sequence[node_pos1]
sequence[node_pos1] = sequence[node_pos2]
sequence[node_pos2] = tmp
return sequence
def node_quality(self, sequence, pos):
if pos < 0:
pos = len(sequence) + pos
no_of_links = 0
middle_node_relations = self.node_set[sequence[pos]]
if pos > 0:
left_node_relations = self.node_set[sequence[pos - 1]]
no_of_links += len(left_node_relations.intersection(middle_node_relations))
if pos < len(sequence) - 1:
right_node_relations = self.node_set[sequence[pos + 1]]
no_of_links += len(middle_node_relations.intersection(right_node_relations))
return no_of_links
@timing
def optimise_order(self, selection_strategy):
top_score = 0
new_top_score = True
self.candidates.update(self.seeds)
while new_top_score:
top_score = max(self.candidates.values())
new_top_score = False
initial_candidates = {name for name, score in self.candidates.items() if score == top_score}
for initial_candidate in initial_candidates:
for candidate, q in selection_strategy(initial_candidate):
if q > top_score:
new_top_score = True
top_score = q
self.candidates[tuple(candidate)] = q
self.weed()
print(f"Number of candidates with top score: {len(list(self.candidates))}")
print(f"{list(self.candidates)[:3]}, ...and more")
print(f"Top score: {top_score}")
def weed(self):
# retain only top-scoring candidates
top_score = max(self.candidates.values())
low_scorers = {k for k, v in self.candidates.items() if v < top_score}
for low_scorer in low_scorers:
del self.candidates[low_scorer]
代码词汇表
node_set
:一组 'unique_node_id': {relation1, relation2, ..., relationN}
形式的标记节点。每个节点的关系集可以不包含关系或包含任意数量。
node
:由node_id
(键)和一组关系(值)
组成的键值对
relation
:正如OP所使用的,这是一个数字。如果两个节点都共享关系 1
并且它们是序列中的邻居,则它会增加 1
到序列的质量。
sequence
:一组有序的节点 ID(例如 ['A'、'B'、'C'] 与 quality
分数相关联。质量分数是序列中节点之间共享关系的总和。启发式的输出是质量分数最高的一个或多个序列。
candidate
: 一个序列,目前正在调查它是否是高质量的。
方法
通过对 linked item
中是否存在每个关系进行稳定排序来生成种子序列
最初呈现的顺序也是种子序列之一,以防接近最优
对于每个种子序列,交换节点对以寻找更高的质量分数
对每个种子序列执行"round"。一轮是对序列的类似 shellsort 的传递,交换节点对,首先在一定距离处,然后缩小距离直到距离为 1(交换直接邻居)。仅保留质量高于当前最高质量分数
如果在这一轮中找到新的最高质量分数,则淘汰除最高分候选人之外的所有候选人,并使用最高得分者作为种子重复 4 次。否则退出。
测试和测试数据生成器
启发式已使用小型 node_sets
、几百到 10,000 个具有非常简单关系的节点的缩放数据以及更像 OP 的测试数据的随机化、高度互连的 node_set
进行了测试发电机。完美的单linked 序列、linked 循环(link 内部的小子序列,以及相互之间的组合)和改组对于发现和修复弱点很有用。
ABC_links = {
'A': {2, 3},
'B': {1, 2},
'C': {4}
}
nick_links = {
'B': {1, 2, 4},
'C': {4},
'A': {2, 3},
'D': {4},
'E': {3},
'F': {5, 6},
'G': {2, 4},
'H': {1},
}
unbroken_chain_linked_tail_to_head = ({1, 3}, {3, 4}, {4, 5}, {5, 6}, {6, 7}, {7, 8}, {8, 9}, {9, 10}, {10, 1})
cycling_unbroken_chain_linked_tail_to_head = itertools.cycle(unbroken_chain_linked_tail_to_head)
def many_nodes_many_relations(node_count):
# data set with n nodes and random 0..n relations as per OP's requirement
relation_range = list(range(node_count))
relation_set = (
set(random.choices(relation_range, k=random.randint(0, node_count)))
for _ in range(sys.maxsize)
)
return scaled_data(node_count, relation_set)
def scaled_data(no_of_items, link_sequence_model):
uppercase = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
# unique labels using sequence of four letters (AAAA, AAAB, AAAC, .., AABA, AABB, ...)
item_names = (''.join(letters) for letters in itertools.product(*([uppercase] * 4)))
# only use a copy of the original link sequence model-- otherwise the model could be exhausted
# or start mid-cycle
#
link_sequence_model, link_sequence = itertools.tee(link_sequence_model)
return {item_name: links for _, item_name, links in zip(range(no_of_items), item_names, link_sequence)}
def shuffled(items_list):
"""relies on Python 3.6+ dictionary insertion-ordered keys"""
shuffled_keys = list(items_list.keys())
random.shuffle(shuffled_keys)
return {k: items_list[k] for k in shuffled_keys}
cycling_quality_1000 = scaled_data(501, cycling_unbroken_chain_linked_tail_to_head)
cycling_quality_1000_shuffled = shuffled(cycling_quality_1000)
linked_forward_sequence = ({n, n + 1} for n in range(sys.maxsize))
# { 'A': {0, 1}, 'B': {1, 2}, ... } links A to B to ...
optimal_single_linked_unbroken_chain = scaled_data(401, linked_forward_sequence)
shuffled_single_linked_unbroken_chain = shuffled(optimal_single_linked_unbroken_chain)
large_node_set = scaled_data(10001, cycling_unbroken_chain_linked_tail_to_head)
large_node_set_shuffled = shuffled(large_node_set)
tests = [
('ABC', 1, ABC_links, True),
('Nick 8-digit', 6, nick_links, True),
# ('Quality <1000, cycling subsequence, small number of relations', 1000 - len(unbroken_chain_linked_tail_to_head), cycling_quality_1000, True),
# ('Quality <1000, cycling subsequence, small number of relations, shuffled', 1000 - len(unbroken_chain_linked_tail_to_head), cycling_quality_1000_shuffled, True),
('Many random relations per node (0..n), n=200', '?', many_nodes_many_relations(200), True),
# ('Quality 400, single unbroken chain, initial solution is optimal', 400, optimal_single_linked_unbroken_chain, False),
# ('Quality 400, single unbroken chain, shuffled', 400, shuffled_single_linked_unbroken_chain, True),
# ('10,000 items, single unbroken chain, initial order is optimal', 10000, large_node_set, False),
# ('10,000 items, single unbroken chain, shuffled', 10000, large_node_set_shuffled, True),
]
for title, expected_quality, item_links, generate_seeds in tests:
la = LinkAnalysis(node_set=item_links, generate_seeds=generate_seeds)
seq_length = len(list(la.node_set.keys()))
print()
print('===============================')
print(f'"{title}" (sequence length {seq_length})')
print('===============================')
la.create_initial_candidates()
print('Finding heuristic candidates...')
la.optimise_order(la.heuristic_candidates)
print(f'Optimal quality: {expected_quality}')
# print('Brute Force working...')
# la.optimise_order(la.brute_force_candidates)
性能
启发式比蛮力解决方案更 'practical',因为它遗漏了许多可能的组合。元素交换产生的质量较低的序列可能实际上与再交换一次后质量更高的分数相差一步,但这种情况可能会在测试之前被淘汰。
启发式方法似乎可以为单linked 序列或循环序列linked head to tail 找到最佳结果。这些有一个已知的最佳解决方案,并且启发式方法找到了该解决方案,并且它们可能比真实数据更不复杂,问题也更少。
引入了 "incremental" 质量计算,可以快速计算出双元素交换产生的质量差异,而无需重新计算整个序列的质量分数,这是一项重大改进。
编辑:误读质量函数,这映射到可分离的旅行商问题
对于 N
个节点、P
个属性和 T
个所有节点的总属性,这应该能够在 O(N + P + T)
或更好的时间内解决,具体取决于数据的拓扑结构。
让我们将您的问题转换为图表,其中任意两个节点之间的 "distance" 为 -(number of shared properties)
。没有连接的节点将保持未链接状态。这将至少需要 O(T)
来创建图表,可能还需要 O(N + P)
来细分。
您的 "sort order" 然后通过节点转换为 "path"。特别是,您想要最短路径。
此外,您将能够应用多个翻译来提高通用算法的性能和可用性:
- 将图分割成不连续的块并独立求解每个块
- 重新规范化所有值以从 1..N 开始而不是 -N..-1(每个图表,但并不重要,可以添加
|number of properties|
)。
https://en.wikipedia.org/wiki/Component_(graph_theory)#Algorithms
It is straightforward to compute the components of a graph in linear time (in terms of the numbers of the vertices and edges of the graph).
https://en.wikipedia.org/wiki/Shortest_path_problem#Undirected_graphs
Weights Time complexity Author
ℝ+ O(V2) Dijkstra 1959
ℝ+ O((E + V) log V) Johnson 1977 (binary heap)
ℝ+ O(E + V log V) Fredman & Tarjan 1984 (Fibonacci heap)
ℕ O(E) Thorup 1999 (requires constant-time multiplication).
我有一个键列表:
['A', 'B', 'C']
对于这些键中的每一个,都有一个属性列表:
{
'A': [2,3],
'B': [1,2],
'C': [4]
}
我希望对标签列表进行排序,使相邻标签共享尽可能多的属性。
在上面的示例中,A
和 B
共享关系 2
,因此它们应该彼此相邻 - 而 C
与它们没有共享关系,因此它可以去任何地方。
因此此示例的可能顺序如下:
["A","B","C"] # acceptable
["A","C","B"] # NOT acceptable
["B","A","C"] # acceptable
["B","C","A"] # NOT acceptable
["C","A","B"] # acceptable
["C","B","A"] # acceptable
桶
实际上我更希望将它们放入 "buckets":
中来表示[["A", "B"], ["C"]] # this can represent all four possible orders above.
但是,如果一个标签属于两个不同的桶,这就会出现问题:
{
'A': [2,3],
'B': [1,2],
'C': [1,4]
}
我该如何表示?
我可以这样说:
[["A", "B"], ["C", "B"]]
但是我需要另一个处理步骤来转换桶列表 进入最终表示:
["A", "B", "C"]
在上面可以有递归嵌套的桶:
[[["A","B"], ["C"]], ["D"]]
然后这些可能重叠:
[[["A","B"], ["C"]], ["A","D"]]
质量
"closeness"即一个解的质量定义为邻居关系的交集之和(质量越高越好):
def measurequality(result,mapping):
lastKey = None
quality = 0
for key in result:
if lastKey is None:
lastKey = key
continue
quality += len(set(mapping[key]).intersection(mapping[lastKey]))
lastKey = key
return quality
# Example determining that the solution ['A', 'B', 'C'] has quality 1:
#measurequality(['A', 'B', 'C'],
# {
# 'A': [2,3],
# 'B': [1,2],
# 'C': [4]
# })
暴力破解
暴力破解不构成解决方案(实际上列表包含数千个元素的顺序 - 但是,如果有人有比 O(n²)
更好的暴力破解方法...) .
但是,使用暴力破解创建额外的测试用例是可能的:
- 生成
L
个n
项的列表['A','B','C',...]
- 为每个项目生成一个关系字典
R
(最多n
个介于0
和n
之间的随机数应该足够了)。 - 生成
L
的所有可能排列,并将它们与R
一起送入measurequality()
并保留具有最大 return 值的排列(可能不是唯一的)。
用于创建随机测试用例以测试实现的代码:
import string
import random
def randomtestcase(n):
keys=list(string.ascii_uppercase[0:n])
minq = 0
maxq = 0
while minq == maxq:
items={}
for key in keys:
items[key] = random.sample(range(1,10),int(random.random()*10))
minq = n*n
minl = list(keys)
maxq = 0
maxl = list(keys)
for _ in range(0, 1000): # TODO: explicitly construct all possible permutations of keys.
random.shuffle(keys)
q = measurequality(keys,items)
if q < minq:
minq = q
minl = list(keys)
if maxq < q:
maxq = q
maxl = list(keys)
return ( items, minl, maxq )
( items, keys, quality ) = randomtestcase(5)
sortedkeys = dosomething( keys, items )
actualquality = measurequality( sortedkeys, items )
if actualquality < quality:
print('Suboptimal: quality {0} < {1}'.format(actualquality,quality))
尝试
许多 "solutions" 中的一个不起作用(非常糟糕,这个没有初始元素的选择/在我在其他结果列表中添加和附加到结果列表之间的选择) :
def dosomething(keys,items):
result = []
todo = list(keys)
result.append(todo.pop())
while any(todo):
lastItems = set(items[result[-1]])
bestScore = None
bestKey = None
for key in todo:
score = set(items[key]).intersection(lastItems)
if bestScore is None or bestScore < score:
bestScore = score
bestKey = key
todo.remove(bestKey)
result.append(bestKey)
return result
示例
(另请查看上面暴力破解部分中的示例生成器。)
正在尝试一些示例的测试代码:
def test(description,acceptable,keys,arguments):
actual = dosomething(keys,arguments)
if "".join(actual) in acceptable:
return 0
print("\n[{0}] {1}".format("".join(keys),description))
print("Expected: {0}\nBut was: {1}".format(acceptable,actual))
print("Quality of result: {0}".format(measurequality(actual,arguments)))
print("Quality of expected: {0}".format([measurequality(a,arguments) for a in acceptable]))
return 1
print("EXAMPLES")
failures = 0
# Need to try each possible ordering of letters to ensure that the order of keys
# wasn't accidentially already a valid ordering.
for keys in [
["A","B","C"],
["A","C","B"],
["B","A","C"],
["B","C","A"],
["C","A","B"],
["C","B","A"]
]:
failures += test(
"1. A and B both have 2, C doesn't, so C can go first or last but not in between.",
["ABC", "BAC", "CAB", "CBA"],
keys,
{
"A": [2,3],
"B": [1,2],
"C": [4]
})
failures += test(
"2. They all have 2, so they can show up in any order.",
["ABC", "ACB", "BAC", "BCA", "CAB", "CBA"],
keys,
{
"A": [2,3],
"B": [1,2],
"C": [2]
})
failures += test(
"3. A and B share 2, B and C share 1, so B must be in the middle.",
["ABC", "CBA"],
keys,
{
"A": [2,3],
"B": [1,2],
"C": [1]
})
failures += test(
"4. Each shares something with each other, creating a cycle, so they can show up in any order.",
["ABC", "ACB", "BAC", "BCA", "CAB", "CBA"],
keys,
{
"A": [2,3],
"B": [1,2],
"C": [1,3]
})
if 0 < failures:
print("{0} FAILURES".format(failures))
优先级
正如所问:用于关系的数字没有优先顺序。存在优先顺序,但它是部分顺序而不是数字之一。我只是没有提到它,因为它使问题变得更难。
所以给出这个例子:
{
'A': [2,3],
'B': [1,2],
'C': [4]
}
可能会替换为以下内容(使用字母代替数字并添加优先信息):
{
'A': [('Q',7),('R',5)],
'B': [('P',6),('Q',6)],
'C': [('S',5)]
}
注意
- 优先级仅在一个列表内有意义,在列表之间没有意义。
- 共享关系的优先级在两个列表之间可能不同。
- 在一个列表中可能有多次相同的优先级。
我正在修改你的测试程序并想出了这个解决方案,它给了我 0 个失败。虽然感觉像是一种启发式方法,但它肯定需要更多的测试和测试用例。该函数假定键是唯一的,因此 ['A', 'A', 'B', ...]
列表和 arguments
字典中的所有元素都存在:
def dosomething(_, arguments):
m = {}
for k, v in arguments.items():
for i in v:
m.setdefault(i, []).append(k)
out, seen = [], set()
for _, values in sorted(m.items(), key=lambda k: -len(k[1])):
for v in values:
if v not in seen:
out.append(v)
seen.add(v)
return out
这是一个 Travelling Salesman Problem,一个众所周知的难以最佳解决的问题。所提供的代码在大约 15 分钟内解决了 10,000 个具有简单互连的节点(即每个节点有一个或两个关系)。对于相互关联更丰富的序列,它的表现不太好。这在下面的测试结果中进行了探讨。
OP 提到的优先级概念未被探讨。
所提供的代码包含一个 heuristic 解决方案,一个对大型 node_set
s 是最佳但不实用的蛮力解决方案,以及一些简单但可扩展的测试数据生成器,其中一些已知最优解。启发式为 OP 的 'ABC' 示例、我自己的 8 项 node_set
以及已知最佳解决方案的可扩展测试数据找到最佳解决方案。
如果性能不够好,至少它是一个初步的努力,并且开始测试 'workshop' 以改进启发式。
测试结果
>>> python sortbylinks.py
===============================
"ABC" (sequence length 3)
===============================
Working...
...choosing seed candidates
Finding heuristic candidates...
Number of candidates with top score: 3
[('A', 'B', 'C'), ('C', 'A', 'B'), ('B', 'A', 'C')], ...and more
Top score: 1
"optimise_order" function took 0.0s
Optimal quality: 1
===============================
"Nick 8-digit" (sequence length 8)
===============================
Working...
...choosing seed candidates
Finding heuristic candidates...
Number of candidates with top score: 1
[('A', 'E', 'F', 'C', 'D', 'G', 'B', 'H')], ...and more
Top score: 6
"optimise_order" function took 0.0s
Optimal quality: 6
简短、相对琐碎的案例似乎没有问题。
===============================
"Quality <1000, cycling subsequence, small number of relations" (sequence length 501)
===============================
Working...
...choosing seed candidates
Finding heuristic candidates...
Number of candidates with top score: 3
[('AAAC', 'AAAL', 'AAAU', ...), ...], ...and more
Top score: 991
"optimise_order" function took 2.0s
Optimal quality: 991
===============================
"Quality <1000, cycling subsequence, small number of relations, shuffled" (sequence length 501)
===============================
Working...
...choosing seed candidates
Finding heuristic candidates...
Number of candidates with top score: 3
[('AADF', 'AAKM', 'AAOZ', ...), ...], ...and more
Top score: 991
"optimise_order" function took 2.0s
Optimal quality: 991
"Quality <1000, cycling subsequence" (sequence length 501)
很有趣。通过将具有 {0, 1}
关系集的节点分组,质量得分几乎可以翻倍。启发式算法会找到这个最佳序列。质量 1000 不太可能,因为这些双 linked 组需要经常通过单 linked 节点相互连接(例如 {'AA': {0, 1}, 'AB': {0, 1}, ..., 'AZ': {0, 1}, <...single link here...> 'BA': {1, 2}, 'BB': {1, 2}, ...}
)。
这个测试数据的性能仍然很好,每个节点的关系很少。
"Quality 400, single unbroken chain, initial solution is optimal" (sequence length 401)
===============================
Working...
Finding heuristic candidates...
Number of candidates with top score: 1
[('AAAA', 'AAAB', 'AAAC', ...), ...], ...and more
Top score: 400
"optimise_order" function took 0.0s
Optimal quality: 400
===============================
"Quality 400, single unbroken chain, shuffled" (sequence length 401)
===============================
Working...
...choosing seed candidates
Finding heuristic candidates...
Number of candidates with top score: 1
[('AAAA', 'AAAB', 'AAAC', ...), ...], ...and more
Top score: 400
"optimise_order" function took 0.0s
Optimal quality: 400
旅行商问题 (TSP) 的难点之一是知道何时有最佳解决方案。即使从接近最优或最优的开始,启发式算法似乎也没有收敛得更快。
===============================
"10,000 items, single unbroken chain, initial order is optimal" (sequence length 10001)
===============================
Working...
Finding heuristic candidates...
Number of candidates with top score: 1
[('AOUQ', 'AAAB', 'AAAC', ...), ...], ...and more
Top score: 10002
"optimise_order" function took 947.0s
Optimal quality: 10000
当关系的数量很少时,即使有很多节点,性能也很好,结果可能接近最优。
===============================
"Many random relations per node (0..n), n=200" (sequence length 200)
===============================
Working...
...choosing seed candidates
Finding heuristic candidates...
Number of candidates with top score: 1
[('AAEO', 'AAHC', 'AAHQ', ...), ...], ...and more
Top score: 6861
"optimise_order" function took 94.0s
Optimal quality: ?
===============================
"Many random relations per node (0..n), n=500" (sequence length 500)
===============================
Working...
...choosing seed candidates
Finding heuristic candidates...
Number of candidates with top score: 1
[('AAJT', 'AAHU', 'AABT', ...), ...], ...and more
Top score: 41999
"optimise_order" function took 4202.0s
Optimal quality: ?
这更像是 OP 生成的数据,也更像是经典的旅行商问题 (TSP),其中每个城市对之间有一组距离('city' 读作 'node') 和节点之间通常有丰富的连接。在这种情况下,节点之间的 links 是部分的——不能保证任何 2 个节点之间的 link。
在这种情况下,启发式算法的时间性能要差得多。对于 n
个节点,每个节点有 0 到 n
个随机关系。这可能意味着更多的交换组合会产生更高的质量,交换和质量检查本身的成本更高,并且在启发式收敛于其最佳结果之前需要更多的遍数。在最坏的情况下,这可能意味着 O(n^3)。
随着节点和关系数量的增加,性能会下降(注意 n=200
-- 3 分钟-- 和 n=500
-- 70 分钟之间的区别。)所以目前启发式可能不是适用于数千个高度互连的节点。
此外,无法准确知道此测试结果的质量,因为蛮力解决方案在计算上不可行。 6861 / 200 = 34.3
和 41999 / 500 = 84.0
节点对之间的平均连接看起来相差不大。
启发式和蛮力求解器代码
import sys
from collections import deque
import itertools
import random
import datetime
# TODO: in-place swapping? (avoid creating copies of sequences)
def timing(f):
"""
decorator for displaying execution time for a method
"""
def wrap(*args):
start = datetime.datetime.now()
f_return_value = f(*args)
end = datetime.datetime.now()
print('"{:s}" function took {:.1f}s'.format(f.__name__, (end-start).seconds))
return f_return_value
return wrap
def flatten(a):
# e.g. [a, [b, c], d] -> [a, b, c, d]
return itertools.chain.from_iterable(a)
class LinkAnalysis:
def __init__(self, node_set, max_ram=100_000_000, generate_seeds=True):
"""
:param node_set: node_ids and their relation sets to be arranged in sequence
:param max_ram: estimated maximum RAM to use
:param generate_seeds: if true, attempt to generate some initial candidates based on sorting
"""
self.node_set = node_set
self.candidates = {}
self.generate_seeds = generate_seeds
self.seeds = {}
self.relations = []
# balance performance and RAM using regular 'weeding'
candidate_size = sys.getsizeof(deque(self.node_set.keys()))
self.weed_interval = max_ram // candidate_size
def create_initial_candidates(self):
print('Working...')
self.generate_seed_from_presented_sequence()
if self.generate_seeds:
self.generate_seed_candidates()
def generate_seed_from_presented_sequence(self):
"""
add initially presented order of nodes as one of the seed candidates
this is worthwhile because the initial order may be close to optimal
"""
presented_sequence = self.presented_sequence()
self.seeds[tuple(self.presented_sequence())] = self.quality(presented_sequence)
def presented_sequence(self) -> list:
return list(self.node_set.keys()) # relies on Python 3.6+ to preserve key order in dicts
def generate_seed_candidates(self):
initial_sequence = self.presented_sequence()
# get relations from the original data structure
relations = sorted(set(flatten(self.node_set.values())))
# sort by lowest precedence relation first
print('...choosing seed candidates')
for relation in reversed(relations):
# use true-false ordering: in Python, True > False
initial_sequence.sort(key=lambda sortkey: not relation in self.node_set[sortkey])
sq = self.quality(initial_sequence)
self.seeds[tuple(initial_sequence)] = sq
def quality(self, sequence):
"""
calculate quality of full sequence
:param sequence:
:return: quality score (int)
"""
pairs = zip(sequence[:-1], sequence[1:])
scores = [len(self.node_set[a].intersection(self.node_set[b]))
for a, b in pairs]
return sum(scores)
def brute_force_candidates(self, sequence):
for sequence in itertools.permutations(sequence):
yield sequence, self.quality(sequence)
def heuristic_candidates(self, seed_sequence):
# look for solutions with higher quality scores by swapping elements
# start with large distances between elements
# then reduce by power of 2 until swapping next-door neighbours
max_distance = len(seed_sequence) // 2
max_pow2 = int(pow(max_distance, 1/2))
distances = [int(pow(2, r)) for r in reversed(range(max_pow2 + 1))]
for distance in distances:
yield from self.seed_and_variations(seed_sequence, distance)
# seed candidate may be better than its derived sequences -- include it as a candidate
yield seed_sequence, self.quality(seed_sequence)
def seed_and_variations(self, seed_sequence, distance=1):
# swap elements at a distance, starting from beginning and end of the
# sequence in seed_sequence
candidate_count = 0
for pos1 in range(len(seed_sequence) - distance):
pos2 = pos1 + distance
q = self.quality(seed_sequence)
# from beginning of sequence
yield self.swap_and_quality(seed_sequence, q, pos1, pos2)
# from end of sequence
yield self.swap_and_quality(seed_sequence, q, -pos1, -pos2)
candidate_count += 2
if candidate_count > self.weed_interval:
self.weed()
candidate_count = 0
def swap_and_quality(self, sequence, preswap_sequence_q: int, pos1: int, pos2: int) -> (tuple, int):
"""
swap and return quality (which can easily be calculated from present quality
:param sequence: as for swap
:param pos1: as for swap
:param pos2: as for swap
:param preswap_sequence_q: quality of pre-swapped sequence
:return: swapped sequence, quality of swapped sequence
"""
initial_node_q = sum(self.node_quality(sequence, pos) for pos in [pos1, pos2])
swapped_sequence = self.swap(sequence, pos1, pos2)
swapped_node_q = sum(self.node_quality(swapped_sequence, pos) for pos in [pos1, pos2])
qdelta = swapped_node_q - initial_node_q
swapped_sequence_q = preswap_sequence_q + qdelta
return swapped_sequence, swapped_sequence_q
def swap(self, sequence, node_pos1: int, node_pos2: int):
"""
deques perform better than lists for swapping elements in a long sequence
:param sequence-- sequence on which to perform the element swap
:param node_pos1-- zero-based position of first element
:param pos2--: zero-based position of second element
>>> swap(('A', 'B', 'C'), 0, 1)
('B', 'A', 'C')
"""
if type(sequence) is tuple:
# sequence is a candidate (which are dict keys and hence tuples)
# needs converting to a list for swap processing
sequence = list(sequence)
if node_pos1 == node_pos2:
return sequence
tmp = sequence[node_pos1]
sequence[node_pos1] = sequence[node_pos2]
sequence[node_pos2] = tmp
return sequence
def node_quality(self, sequence, pos):
if pos < 0:
pos = len(sequence) + pos
no_of_links = 0
middle_node_relations = self.node_set[sequence[pos]]
if pos > 0:
left_node_relations = self.node_set[sequence[pos - 1]]
no_of_links += len(left_node_relations.intersection(middle_node_relations))
if pos < len(sequence) - 1:
right_node_relations = self.node_set[sequence[pos + 1]]
no_of_links += len(middle_node_relations.intersection(right_node_relations))
return no_of_links
@timing
def optimise_order(self, selection_strategy):
top_score = 0
new_top_score = True
self.candidates.update(self.seeds)
while new_top_score:
top_score = max(self.candidates.values())
new_top_score = False
initial_candidates = {name for name, score in self.candidates.items() if score == top_score}
for initial_candidate in initial_candidates:
for candidate, q in selection_strategy(initial_candidate):
if q > top_score:
new_top_score = True
top_score = q
self.candidates[tuple(candidate)] = q
self.weed()
print(f"Number of candidates with top score: {len(list(self.candidates))}")
print(f"{list(self.candidates)[:3]}, ...and more")
print(f"Top score: {top_score}")
def weed(self):
# retain only top-scoring candidates
top_score = max(self.candidates.values())
low_scorers = {k for k, v in self.candidates.items() if v < top_score}
for low_scorer in low_scorers:
del self.candidates[low_scorer]
代码词汇表
node_set
:一组 'unique_node_id': {relation1, relation2, ..., relationN}
形式的标记节点。每个节点的关系集可以不包含关系或包含任意数量。
node
:由node_id
(键)和一组关系(值)
relation
:正如OP所使用的,这是一个数字。如果两个节点都共享关系 1
并且它们是序列中的邻居,则它会增加 1
到序列的质量。
sequence
:一组有序的节点 ID(例如 ['A'、'B'、'C'] 与 quality
分数相关联。质量分数是序列中节点之间共享关系的总和。启发式的输出是质量分数最高的一个或多个序列。
candidate
: 一个序列,目前正在调查它是否是高质量的。
方法
通过对 linked item
中是否存在每个关系进行稳定排序来生成种子序列
最初呈现的顺序也是种子序列之一,以防接近最优
对于每个种子序列,交换节点对以寻找更高的质量分数
对每个种子序列执行"round"。一轮是对序列的类似 shellsort 的传递,交换节点对,首先在一定距离处,然后缩小距离直到距离为 1(交换直接邻居)。仅保留质量高于当前最高质量分数
如果在这一轮中找到新的最高质量分数,则淘汰除最高分候选人之外的所有候选人,并使用最高得分者作为种子重复 4 次。否则退出。
测试和测试数据生成器
启发式已使用小型 node_sets
、几百到 10,000 个具有非常简单关系的节点的缩放数据以及更像 OP 的测试数据的随机化、高度互连的 node_set
进行了测试发电机。完美的单linked 序列、linked 循环(link 内部的小子序列,以及相互之间的组合)和改组对于发现和修复弱点很有用。
ABC_links = {
'A': {2, 3},
'B': {1, 2},
'C': {4}
}
nick_links = {
'B': {1, 2, 4},
'C': {4},
'A': {2, 3},
'D': {4},
'E': {3},
'F': {5, 6},
'G': {2, 4},
'H': {1},
}
unbroken_chain_linked_tail_to_head = ({1, 3}, {3, 4}, {4, 5}, {5, 6}, {6, 7}, {7, 8}, {8, 9}, {9, 10}, {10, 1})
cycling_unbroken_chain_linked_tail_to_head = itertools.cycle(unbroken_chain_linked_tail_to_head)
def many_nodes_many_relations(node_count):
# data set with n nodes and random 0..n relations as per OP's requirement
relation_range = list(range(node_count))
relation_set = (
set(random.choices(relation_range, k=random.randint(0, node_count)))
for _ in range(sys.maxsize)
)
return scaled_data(node_count, relation_set)
def scaled_data(no_of_items, link_sequence_model):
uppercase = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
# unique labels using sequence of four letters (AAAA, AAAB, AAAC, .., AABA, AABB, ...)
item_names = (''.join(letters) for letters in itertools.product(*([uppercase] * 4)))
# only use a copy of the original link sequence model-- otherwise the model could be exhausted
# or start mid-cycle
#
link_sequence_model, link_sequence = itertools.tee(link_sequence_model)
return {item_name: links for _, item_name, links in zip(range(no_of_items), item_names, link_sequence)}
def shuffled(items_list):
"""relies on Python 3.6+ dictionary insertion-ordered keys"""
shuffled_keys = list(items_list.keys())
random.shuffle(shuffled_keys)
return {k: items_list[k] for k in shuffled_keys}
cycling_quality_1000 = scaled_data(501, cycling_unbroken_chain_linked_tail_to_head)
cycling_quality_1000_shuffled = shuffled(cycling_quality_1000)
linked_forward_sequence = ({n, n + 1} for n in range(sys.maxsize))
# { 'A': {0, 1}, 'B': {1, 2}, ... } links A to B to ...
optimal_single_linked_unbroken_chain = scaled_data(401, linked_forward_sequence)
shuffled_single_linked_unbroken_chain = shuffled(optimal_single_linked_unbroken_chain)
large_node_set = scaled_data(10001, cycling_unbroken_chain_linked_tail_to_head)
large_node_set_shuffled = shuffled(large_node_set)
tests = [
('ABC', 1, ABC_links, True),
('Nick 8-digit', 6, nick_links, True),
# ('Quality <1000, cycling subsequence, small number of relations', 1000 - len(unbroken_chain_linked_tail_to_head), cycling_quality_1000, True),
# ('Quality <1000, cycling subsequence, small number of relations, shuffled', 1000 - len(unbroken_chain_linked_tail_to_head), cycling_quality_1000_shuffled, True),
('Many random relations per node (0..n), n=200', '?', many_nodes_many_relations(200), True),
# ('Quality 400, single unbroken chain, initial solution is optimal', 400, optimal_single_linked_unbroken_chain, False),
# ('Quality 400, single unbroken chain, shuffled', 400, shuffled_single_linked_unbroken_chain, True),
# ('10,000 items, single unbroken chain, initial order is optimal', 10000, large_node_set, False),
# ('10,000 items, single unbroken chain, shuffled', 10000, large_node_set_shuffled, True),
]
for title, expected_quality, item_links, generate_seeds in tests:
la = LinkAnalysis(node_set=item_links, generate_seeds=generate_seeds)
seq_length = len(list(la.node_set.keys()))
print()
print('===============================')
print(f'"{title}" (sequence length {seq_length})')
print('===============================')
la.create_initial_candidates()
print('Finding heuristic candidates...')
la.optimise_order(la.heuristic_candidates)
print(f'Optimal quality: {expected_quality}')
# print('Brute Force working...')
# la.optimise_order(la.brute_force_candidates)
性能
启发式比蛮力解决方案更 'practical',因为它遗漏了许多可能的组合。元素交换产生的质量较低的序列可能实际上与再交换一次后质量更高的分数相差一步,但这种情况可能会在测试之前被淘汰。
启发式方法似乎可以为单linked 序列或循环序列linked head to tail 找到最佳结果。这些有一个已知的最佳解决方案,并且启发式方法找到了该解决方案,并且它们可能比真实数据更不复杂,问题也更少。
引入了 "incremental" 质量计算,可以快速计算出双元素交换产生的质量差异,而无需重新计算整个序列的质量分数,这是一项重大改进。
编辑:误读质量函数,这映射到可分离的旅行商问题
对于 N
个节点、P
个属性和 T
个所有节点的总属性,这应该能够在 O(N + P + T)
或更好的时间内解决,具体取决于数据的拓扑结构。
让我们将您的问题转换为图表,其中任意两个节点之间的 "distance" 为 -(number of shared properties)
。没有连接的节点将保持未链接状态。这将至少需要 O(T)
来创建图表,可能还需要 O(N + P)
来细分。
您的 "sort order" 然后通过节点转换为 "path"。特别是,您想要最短路径。
此外,您将能够应用多个翻译来提高通用算法的性能和可用性:
- 将图分割成不连续的块并独立求解每个块
- 重新规范化所有值以从 1..N 开始而不是 -N..-1(每个图表,但并不重要,可以添加
|number of properties|
)。
https://en.wikipedia.org/wiki/Component_(graph_theory)#Algorithms
It is straightforward to compute the components of a graph in linear time (in terms of the numbers of the vertices and edges of the graph).
https://en.wikipedia.org/wiki/Shortest_path_problem#Undirected_graphs
Weights Time complexity Author
ℝ+ O(V2) Dijkstra 1959
ℝ+ O((E + V) log V) Johnson 1977 (binary heap)
ℝ+ O(E + V log V) Fredman & Tarjan 1984 (Fibonacci heap)
ℕ O(E) Thorup 1999 (requires constant-time multiplication).