大 Neo4j 图上的缓慢聚合

Slow aggregation on big neo4j graph

配置:
Windows 8.1
neo4j-enterprise-2.2.0-M03
缓存类型:hpc
8Gb 内存
6Gb 用于 JVM 堆 (wrapper.java.initmemory=6144 wrapper.java.maxmemory=6144)
6Gb JVM 堆中的 5Gb 用于映射内存 (dbms.pagecache.memory=5G)

型号:
模型表示用户如何浏览网站。
27 522 896 个节点 (394Mb)
111 294 796 个关系 (3609Mb)
33 906 363 个属性 (1326Mb)

293 (:Page) 个节点
27522603 (:PageView) 个节点
0 (:User) 个节点(尚未加载)
与(:Page)节点相连的每个(:PageView)节点
每个(:PageView)节点与下一个(:PageView)节点相连
每个 (:PageView) 节点与 (:User) 节点连接(尚未)

查询

match (:Page {Name:'#########.aspx'})<-[:At]-(:PageView)-[:Next]->(:PageView)-[:At]->(p:Page)
return p.Name,count(*) as count
order by count desc
limit 10;

个人资料信息:

+------------------------------------------------+
| p.Name                               | count   |
+------------------------------------------------+
| "#####################.aspx"         | 5172680 |
| "###############.aspx"               | 3846455 |
| "#########.aspx"                     | 3579022 |
| "###########.aspx"                   | 3051043 |
| "#############################.aspx" | 1713004 |
| "############.aspx"                  | 1373928 |
| "############.aspx"                  | 1338063 |
| "#####.aspx"                         | 1285447 |
| "###################.aspx"           | 884077  |
| "##############.aspx"                | 759665  |
+------------------------------------------------+
10 rows
195363 ms

Compiler CYPHER 2.2

Planner COST

Projection(0)
  |
  +Top
    |
    +EagerAggregation
      |
      +Projection(1)
        |
        +Filter(0)
          |
          +Expand(All)(0)
            |
            +Filter(1)
              |
              +Expand(All)(1)
                |
                +Filter(2)
                  |
                  +Expand(All)(2)
                    |
                    +NodeUniqueIndexSeek

+---------------------+---------------+----------+----------+-------------------------------------------+--------------------------------------------------+
|            Operator | EstimatedRows |     Rows |   DbHits |                               Identifiers |                                            Other |
+---------------------+---------------+----------+----------+-------------------------------------------+--------------------------------------------------+
|       Projection(0) |           881 |       10 |        0 |   FRESHID105,   FRESHID110, count, p.Name |                                    p.Name, count |
|                 Top |           881 |       10 |        0 |                  FRESHID105,   FRESHID110 |                                   {  AUTOINT1};  |
|    EagerAggregation |           881 |      173 |        0 |                  FRESHID105,   FRESHID110 |                                                  |
|       Projection(1) |        776404 | 35941815 | 71883630 |                             FRESHID105, p |                                                  |
|           Filter(0) |        776404 | 35941815 | 35941815 |                                         p | (NOT(anon[38] == anon[78]) AND hasLabel(p:Page)) |
|      Expand(All)(0) |        776404 | 35941815 | 49287436 |                                         p |                                    ()-[:At]->(p) |
|           Filter(1) |        384001 | 13345621 | 13345621 |                                           |                      hasLabel(anon[67]:PageView) |
|      Expand(All)(1) |        384001 | 13345621 | 19478500 |                                           |                                   ()-[:Next]->() |
|           Filter(2) |        189923 |  6132879 |  6132879 |                                           |                      hasLabel(anon[46]:PageView) |
|      Expand(All)(2) |        189923 |  6132879 |  6132880 |                                           |                                     ()<-[:At]-() |
| NodeUniqueIndexSeek |             1 |        1 |        1 |                                           |                                      :Page(Name) |
+---------------------+---------------+----------+----------+-------------------------------------------+--------------------------------------------------+

Total database accesses: 202202762

没有不必要标签的查询

match (:Page {Name:'Dashboard.aspx'})<-[:At]-()-[:Next]->()-[:At]->(p)
return p.Name,count(*) as count
order by count desc
limit 10;

个人资料信息:

+------------------------------------------------+
| p.Name                               | count   |
+------------------------------------------------+
| "#####################.aspx"         | 5172680 |
| "###############.aspx"               | 3846455 |
| "#########.aspx"                     | 3579022 |
| "###########.aspx"                   | 3051043 |
| "#############################.aspx" | 1713004 |
| "############.aspx"                  | 1373928 |
| "############.aspx"                  | 1338063 |
| "#####.aspx"                         | 1285447 |
| "###################.aspx"           | 884077  |
| "##############.aspx"                | 759665  |
+------------------------------------------------+
10 rows
166751 ms

Compiler CYPHER 2.2

Planner COST

Projection(0)
  |
  +Top
    |
    +EagerAggregation
      |
      +Projection(1)
        |
        +Filter
          |
          +Expand(All)(0)
            |
            +Expand(All)(1)
              |
              +Expand(All)(2)
                |
                +NodeUniqueIndexSeek

+---------------------+---------------+----------+----------+-----------------------------------------+---------------------------+
|            Operator | EstimatedRows |     Rows |   DbHits |                             Identifiers |                     Other |
+---------------------+---------------+----------+----------+-----------------------------------------+---------------------------+
|       Projection(0) |           881 |       10 |        0 |   FRESHID82,   FRESHID87, count, p.Name |             p.Name, count |
|                 Top |           881 |       10 |        0 |                  FRESHID82,   FRESHID87 |            {  AUTOINT1};  |
|    EagerAggregation |           881 |      173 |        0 |                  FRESHID82,   FRESHID87 |                           |
|       Projection(1) |        776388 | 35941815 | 71883630 |                            FRESHID82, p |                           |
|              Filter |        776388 | 35941815 |        0 |                                       p | NOT(anon[38] == anon[60]) |
|      Expand(All)(0) |        776388 | 35941815 | 49287436 |                                       p |             ()-[:At]->(p) |
|      Expand(All)(1) |        383997 | 13345621 | 19478500 |                                         |            ()-[:Next]->() |
|      Expand(All)(2) |        189923 |  6132879 |  6132880 |                                         |              ()<-[:At]-() |
| NodeUniqueIndexSeek |             1 |        1 |        1 |                                         |               :Page(Name) |
+---------------------+---------------+----------+----------+-----------------------------------------+---------------------------+

Total database accesses: 146782447

Message.log
问题
如何更快地执行此查询? (更多 RAM、重构查询、分布式缓存、使用另一个 language/shell/method、...)

更新:
回答中最后一个查询的个人资料信息

neo4j-sh (?)$ profile match (:Page {Name:'Dashboard.aspx'})<-[:At]-()-[:Next]->()-[:At]->(p)
with p,count(*) as count
order by count desc
limit 10 return p.Name, count;
+------------------------------------------------+
| p.Name                               | count   |
+------------------------------------------------+
| "OutgoingDocumentsList.aspx"         | 5172680 |
| "DocumentPreview.aspx"               | 3846455 |
| "Dashboard.aspx"                     | 3579022 |
| "ActualTasks.aspx"                   | 3051043 |
| "DocumentFillMissingRequisites.aspx" | 1713004 |
| "EditDocument.aspx"                  | 1373928 |
| "PaymentsList.aspx"                  | 1338063 |
| "Login.aspx"                         | 1285447 |
| "ReportingRequisites.aspx"           | 884077  |
| "ContractorInfo.aspx"                | 759665  |
+------------------------------------------------+
10 rows
151328 ms

Compiler CYPHER 2.2

Planner COST

Projection
  |
  +Top
    |
    +EagerAggregation
      |
      +Filter
        |
        +Expand(All)(0)
          |
          +Expand(All)(1)
            |
            +Expand(All)(2)
              |
              +NodeUniqueIndexSeek

+---------------------+---------------+----------+----------+------------------+---------------------------+
|            Operator | EstimatedRows |     Rows |   DbHits |      Identifiers |                     Other |
+---------------------+---------------+----------+----------+------------------+---------------------------+
|          Projection |           881 |       10 |       20 | count, p, p.Name |             p.Name, count |
|                 Top |           881 |       10 |        0 |         count, p |       {  AUTOINT1}; count |
|    EagerAggregation |           881 |      173 |        0 |         count, p |                         p |
|              Filter |        776388 | 35941815 |        0 |                p | NOT(anon[38] == anon[60]) |
|      Expand(All)(0) |        776388 | 35941815 | 49287436 |                p |             ()-[:At]->(p) |
|      Expand(All)(1) |        383997 | 13345621 | 19478500 |                  |            ()-[:Next]->() |
|      Expand(All)(2) |        189923 |  6132879 |  6132880 |                  |              ()<-[:At]-() |
| NodeUniqueIndexSeek |             1 |        1 |        1 |                  |               :Page(Name) |
+---------------------+---------------+----------+----------+------------------+---------------------------+

Total database accesses: 74898837

正如我之前提到的,在你的另一个问题中,如果你能写一个 Java based server extension 你就可以很容易地做到。

// initialize counters
Map<Node,AtomicInteger> pageCounts = new HashMap<>(300);
for (Node page : graphDb.findNode(Page)) pageCounts.put(page,new AtomicInteger());

// find start page
Label Page = DynamicLabel.label("Page");
Node page = graphDB.findNode(Page,"Name",pageName).iterator().next();

// follow page-view relationships
for (Relationship at : page.getRelationships(At, INCOMING)) {
   // follow singular next relationship
   Relationship at2 = at.getStartNode().getSingleRelationship(Next,OUTGOING);
   if (at2==null) continue;
   // follow singular page-view relationship to end-page
   Node page2 = at2.getSingleRelationship(At,OUTGOING).getEndNode();
   // increment counter
   pageCounts.get(page2).incrementAndGet();
}

// sort pages by count descending
List pages = new ArrayList(pageCounts.entrySet())
Collections.sort(pages,new Comparator<Map.Entry<Node,Integer>>() {
    public int compare(Map.Entry<Node,Integer> e1, Map.Entry<Node,Integer> e2) {
       return - Integer.compare(e1.getValue(),e2.getValue());
    }
});
// return top 10
return pages.subList(0,10);

对于 Cypher,我会尝试这样的操作:

match (:Page {Name:'#########.aspx'})<-[:At]-(pv:PageView)
WITH distinct pv 
MATCH (pv)-[:Next]->(pv2:PageView)
with distinct pv2
match (pv2)-[:At]->(p:Page)
return p.Name,count(*) as count
order by count desc
limit 10;

更新

我为它写了一个测试 运行 它在我更大的 linux 机器上,结果更合理:在 1.6s 之间Cypher 中的 Java 和 5s max

这是代码和结果:https://gist.github.com/jexp/94f75ddb849f8c41c97c

在 Cypher 中:

-------------------
match (:Page {Name:'Page1'})<-[:At]-()-[:Next]->()-[:At]->(p)
return p.Name,count(*) as count
order by count desc
limit 10;
+-------------------+
| p.Name    | count |
+-------------------+
| "Page169" | 975   |
| "Page125" | 959   |
| "Page106" | 955   |
| "Page274" | 951   |
| "Page176" | 947   |
| "Page241" | 944   |
| "Page30"  | 942   |
| "Page44"  | 938   |
| "Page1"   | 938   |
| "Page118" | 938   |
+-------------------+
10 rows

in 3212 ms
[Compiler CYPHER 2.2

Planner COST

+---------------------+---------------+--------+--------+--------------------------+---------------------------+
|            Operator | EstimatedRows |   Rows | DbHits |              Identifiers |                     Other |
+---------------------+---------------+--------+--------+--------------------------+---------------------------+
|                 Top |           488 |     10 |      0 |   FRESHID71,   FRESHID76 |            {  AUTOINT1};  |
|    EagerAggregation |           488 |    300 |      0 |   FRESHID71,   FRESHID76 |                           |
|          Projection |        238460 | 264828 | 529656 |             FRESHID71, p |                           |
|              Filter |        238460 | 264828 |      0 |                        p | NOT(anon[29] == anon[51]) |
|      Expand(All)(0) |        238460 | 264828 | 529656 |                        p |             ()-[:At]->(p) |
|      Expand(All)(1) |        238460 | 264828 | 778522 |                          |            ()-[:Next]->() |
|      Expand(All)(2) |        476922 | 513694 | 513695 |                          |              ()<-[:At]-() |
| NodeUniqueIndexSeek |             1 |      1 |      1 |                          |               :Page(Name) |
+---------------------+---------------+--------+--------+--------------------------+---------------------------+

Total database accesses: 2351530]

在Java中:

-------------------
Java took 1618 ms
Node[169]=975
Node[125]=959
Node[106]=955
Node[274]=951
Node[176]=947
Node[241]=944
Node[30]=942
Node[1]=938
Node[44]=938
Node[118]=938

你还可以做一些事情来加速你的 Cypher 查询,就是只在节点上聚合,并且只有 return 最后 10 行的 page.Name 属性,很多更快。

match (:Page {Name:'Page1'})<-[:At]-()-[:Next]->()-[:At]->(p)
with p,count(*) as count
order by count desc
limit 10 return p.Name, count