Neo4J - 在非常大的图形上寻找最宽的路径
Neo4J - Finding the widest path on very large graphs
我创建了一个非常大的方向加权图,我试图找到两点之间的最宽路径。
每条边都有 count 属性
这是图表的一小部分:
我找到了 this example 并修改了查询,因此路径收集将是定向的,如下所示:
MATCH p = (v1:Vertex {name:'ENTRY'})-[:TRAVELED*]->(v2:Vertex {name:'EXIT'})
WITH p, EXTRACT(c IN RELATIONSHIPS(p) | c.count) AS counts
UNWIND(counts) AS b
WITH p, MIN(b) AS count
ORDER BY count DESC
RETURN NODES(p) AS `Widest Path`, count
LIMIT 1
这个查询似乎需要大量内存,甚至在部分数据上也会失败。
更新:对于分类,查询 运行ning 直到 运行ning 内存不足。
我发现 this link,它结合了 spark 和 neo4j 的使用。不幸的是,Neo4j 的 Maze运行ner 不支持开箱即用的 "widest path" 算法。在非常大的图上 运行 "widest path" 查询的正确方法是什么?
你的算法需要很长时间才能达到 运行 的原因是 (a) 你有一个大图,(b) 你的内存参数可能需要调整(见评论)和 (c) 你'重新枚举 ENTRY
和 EXIT
之间的所有可能路径。根据您的图表的结构,这可能是大量的路径。
请注意,如果您正在寻找最宽的路径,那么最宽的就是边上的 largest/smallest 权重。这意味着您可能正在计算和重新计算许多您可以忽略的路径。
Wikipedia has good information on this algorithm you should consider。特别是:
It is possible to find maximum-capacity paths and minimax paths with a
single source and single destination very efficiently even in models
of computation that allow only comparisons of the input graph's edge
weights and not arithmetic on them.[12][18] The algorithm maintains a
set S of edges that are known to contain the bottleneck edge of the
optimal path; initially, S is just the set of all m edges of the
graph. At each iteration of the algorithm, it splits S into an ordered
sequence of subsets S1, S2, ... of approximately equal size; the
number of subsets in this partition is chosen in such a way that all
of the split points between subsets can be found by repeated
median-finding in time O(m). The algorithm then reweights each edge of
the graph by the index of the subset containing the edge, and uses the
modified Dijkstra algorithm on the reweighted graph; based on the
results of this computation, it can determine in linear time which of
the subsets contains the bottleneck edge weight. It then replaces S by
the subset Si that it has determined to contain the bottleneck weight,
and starts the next iteration with this new set S. The number of
subsets into which S can be split increases exponentially with each
step, so the number of iterations is proportional to the iterated
logarithm function, O(logn), and the total time is O(m logn).[18] In
a model of computation where each edge weight is a machine integer,
the use of repeated bisection in this algorithm can be replaced by a
list-splitting technique of Han & Thorup (2002), allowing S to be
split into O(√m) smaller sets Si in a single step and leading to a
linear overall time bound.
您应该考虑使用 cypher 而不是当前的 "enumerate all paths" 方法来实现此方法,因为 "enumerate all paths" 方法需要您重新检查相同的边计数以获取尽可能多的路径那个特定的边缘。
没有现成的软件可以为您做这件事,我建议您阅读该段落(并检查其引用以获取更多信息)然后实施它。我认为在性能方面您可以比当前查询做得更好。
一些想法。
可以简化您的查询(以及原始示例查询)。这可能足以也可能不足以防止您的记忆问题。
对于每个匹配的路径,无需:(a) 创建计数集合,(b) 将其展开为行,然后 (c) 执行 MIN 聚合。使用 REDUCE 函数可以获得相同的结果:
MATCH p = (v1:Vertex {name:'ENTRY'})-[:TRAVELED*]->(v2:Vertex {name:'EXIT'})
WITH p, REDUCE(m = 2147483647, c IN RELATIONSHIPS(p) | CASE WHEN c.count < m THEN c.count ELSE m END) AS count
ORDER BY count DESC
RETURN NODES(p) AS `Widest Path`, count
LIMIT 1;
(我假设 count
属性 值是一个整数。2147483647
是最大整数值。)
您应该在 Vertex
标签的 name
属性 上创建一个索引(或者更恰当地说,一个唯一性约束)。例如:
CREATE INDEX ON :Vertex(name)
已编辑
上述查询的增强版本可能解决您的内存问题:
MERGE (t:Temp) SET t.count = 0, t.widest_path = NULL
WITH t
OPTIONAL MATCH p = (v1:Vertex {name:'ENTRY'})-[:TRAVELED*]->(v2:Vertex {name:'EXIT'})
WITH t, p, REDUCE(m = 2147483647, c IN RELATIONSHIPS(p) | CASE WHEN c.count < m THEN c.count ELSE m END) AS count
WHERE count > t.count
SET t.count = count, t.widest_path = NODES(p)
WITH COLLECT(DISTINCT t)[0] AS t
WITH t, t.count AS count, t.widest_path AS `Widest Path`
DELETE t
RETURN `Widest Path`, count;
它创建(并最终删除)一个临时 :Temp
节点来跟踪当前 "winning" 计数和(相应的路径节点)。 (您必须确保没有使用标签 Temp
。)
以 COLLECT(DISTINCT t)
开头的 WITH
子句使用不同 :Temp
节点(其中只有 1 个)的聚合来确保 Cypher 仅保留对 :Temp
节点,不管有多少路径满足WHERE
子句。此外,WITH
子句不包含 p
,因此 Cypher 不会累积我们不关心的路径。正是这个子句 可能 最重要,有助于避免您的记忆问题。
我还没试过。
我创建了一个非常大的方向加权图,我试图找到两点之间的最宽路径。
每条边都有 count 属性
这是图表的一小部分:
我找到了 this example 并修改了查询,因此路径收集将是定向的,如下所示:
MATCH p = (v1:Vertex {name:'ENTRY'})-[:TRAVELED*]->(v2:Vertex {name:'EXIT'})
WITH p, EXTRACT(c IN RELATIONSHIPS(p) | c.count) AS counts
UNWIND(counts) AS b
WITH p, MIN(b) AS count
ORDER BY count DESC
RETURN NODES(p) AS `Widest Path`, count
LIMIT 1
这个查询似乎需要大量内存,甚至在部分数据上也会失败。
更新:对于分类,查询 运行ning 直到 运行ning 内存不足。
我发现 this link,它结合了 spark 和 neo4j 的使用。不幸的是,Neo4j 的 Maze运行ner 不支持开箱即用的 "widest path" 算法。在非常大的图上 运行 "widest path" 查询的正确方法是什么?
你的算法需要很长时间才能达到 运行 的原因是 (a) 你有一个大图,(b) 你的内存参数可能需要调整(见评论)和 (c) 你'重新枚举 ENTRY
和 EXIT
之间的所有可能路径。根据您的图表的结构,这可能是大量的路径。
请注意,如果您正在寻找最宽的路径,那么最宽的就是边上的 largest/smallest 权重。这意味着您可能正在计算和重新计算许多您可以忽略的路径。
Wikipedia has good information on this algorithm you should consider。特别是:
It is possible to find maximum-capacity paths and minimax paths with a single source and single destination very efficiently even in models of computation that allow only comparisons of the input graph's edge weights and not arithmetic on them.[12][18] The algorithm maintains a set S of edges that are known to contain the bottleneck edge of the optimal path; initially, S is just the set of all m edges of the graph. At each iteration of the algorithm, it splits S into an ordered sequence of subsets S1, S2, ... of approximately equal size; the number of subsets in this partition is chosen in such a way that all of the split points between subsets can be found by repeated median-finding in time O(m). The algorithm then reweights each edge of the graph by the index of the subset containing the edge, and uses the modified Dijkstra algorithm on the reweighted graph; based on the results of this computation, it can determine in linear time which of the subsets contains the bottleneck edge weight. It then replaces S by the subset Si that it has determined to contain the bottleneck weight, and starts the next iteration with this new set S. The number of subsets into which S can be split increases exponentially with each step, so the number of iterations is proportional to the iterated logarithm function, O(logn), and the total time is O(m logn).[18] In a model of computation where each edge weight is a machine integer, the use of repeated bisection in this algorithm can be replaced by a list-splitting technique of Han & Thorup (2002), allowing S to be split into O(√m) smaller sets Si in a single step and leading to a linear overall time bound.
您应该考虑使用 cypher 而不是当前的 "enumerate all paths" 方法来实现此方法,因为 "enumerate all paths" 方法需要您重新检查相同的边计数以获取尽可能多的路径那个特定的边缘。
没有现成的软件可以为您做这件事,我建议您阅读该段落(并检查其引用以获取更多信息)然后实施它。我认为在性能方面您可以比当前查询做得更好。
一些想法。
可以简化您的查询(以及原始示例查询)。这可能足以也可能不足以防止您的记忆问题。
对于每个匹配的路径,无需:(a) 创建计数集合,(b) 将其展开为行,然后 (c) 执行 MIN 聚合。使用 REDUCE 函数可以获得相同的结果:
MATCH p = (v1:Vertex {name:'ENTRY'})-[:TRAVELED*]->(v2:Vertex {name:'EXIT'}) WITH p, REDUCE(m = 2147483647, c IN RELATIONSHIPS(p) | CASE WHEN c.count < m THEN c.count ELSE m END) AS count ORDER BY count DESC RETURN NODES(p) AS `Widest Path`, count LIMIT 1;
(我假设
count
属性 值是一个整数。2147483647
是最大整数值。)您应该在
Vertex
标签的name
属性 上创建一个索引(或者更恰当地说,一个唯一性约束)。例如:CREATE INDEX ON :Vertex(name)
已编辑
上述查询的增强版本可能解决您的内存问题:
MERGE (t:Temp) SET t.count = 0, t.widest_path = NULL
WITH t
OPTIONAL MATCH p = (v1:Vertex {name:'ENTRY'})-[:TRAVELED*]->(v2:Vertex {name:'EXIT'})
WITH t, p, REDUCE(m = 2147483647, c IN RELATIONSHIPS(p) | CASE WHEN c.count < m THEN c.count ELSE m END) AS count
WHERE count > t.count
SET t.count = count, t.widest_path = NODES(p)
WITH COLLECT(DISTINCT t)[0] AS t
WITH t, t.count AS count, t.widest_path AS `Widest Path`
DELETE t
RETURN `Widest Path`, count;
它创建(并最终删除)一个临时 :Temp
节点来跟踪当前 "winning" 计数和(相应的路径节点)。 (您必须确保没有使用标签 Temp
。)
以 COLLECT(DISTINCT t)
开头的 WITH
子句使用不同 :Temp
节点(其中只有 1 个)的聚合来确保 Cypher 仅保留对 :Temp
节点,不管有多少路径满足WHERE
子句。此外,WITH
子句不包含 p
,因此 Cypher 不会累积我们不关心的路径。正是这个子句 可能 最重要,有助于避免您的记忆问题。
我还没试过。