ArangoDB 分面搜索性能

Question

我们正在 space 方面计算中评估 ArangoDB 的性能。有许多其他产品能够通过特殊 API 或查询语言来做同样的事情：

MarkLogic 方面
ElasticSearch 聚合
Solr 分面等

我们明白，Arango 中没有特殊的 API 来显式计算事实。但实际上，这不是必需的，感谢全面的 AQL，它可以通过简单的查询轻松实现，例如：

 FOR a in Asset 
  COLLECT attr = a.attribute1 INTO g
 RETURN { value: attr, count: length(g) }

此查询计算属性 1 上的一个方面，并以以下形式产生频率：

[
  {
    "value": "test-attr1-1",
    "count": 2000000
  },
  {
    "value": "test-attr1-2",
    "count": 2000000
  },
  {
    "value": "test-attr1-3",
    "count": 3000000
  }
]

也就是说，在我的整个集合中，attribute1 采用了三种形式（test-attr1-1、test-attr1-2 和 test-attr1-3），并提供了相关计数。我们几乎运行一个 DISTINCT 查询和聚合计数。

看起来简单干净。只有一个但非常大的问题 - 性能。

提供了超过运行s 的查询 !31 秒！在只有 8M 文档的测试集合之上。我们尝试了不同的索引类型、存储引擎（有和没有 rocksdb），调查解释计划无济于事。我们在本次测试中使用的测试文档非常简洁，只有三个简短的属性。

我们将不胜感激此时的任何意见。要么我们做错了什么。或者 ArangoDB 根本就不是为在这个特定领域执行而设计的。

顺便说一句，最终目标是运行在不到一秒的时间内实现如下内容：

LET docs = (FOR a IN Asset 

  FILTER a.name like 'test-asset-%'

  SORT a.name

 RETURN a)

LET attribute1 = (

 FOR a in docs 

  COLLECT attr = a.attribute1 INTO g

 RETURN { value: attr, count: length(g[*])}

)

LET attribute2 = (

 FOR a in docs 

  COLLECT attr = a.attribute2 INTO g

 RETURN { value: attr, count: length(g[*])}

)

LET attribute3 = (

 FOR a in docs 

  COLLECT attr = a.attribute3 INTO g

 RETURN { value: attr, count: length(g[*])}

)

LET attribute4 = (

 FOR a in docs 

  COLLECT attr = a.attribute4 INTO g

 RETURN { value: attr, count: length(g[*])}

)

RETURN {

  counts: (RETURN {

    total: LENGTH(docs), 

    offset: 2, 

    to: 4, 

    facets: {

      attribute1: {

        from: 0, 

        to: 5,

        total: LENGTH(attribute1)

      },

      attribute2: {

        from: 5, 

        to: 10,

        total: LENGTH(attribute2)

      },

      attribute3: {

        from: 0, 

        to: 1000,

        total: LENGTH(attribute3)

      },

      attribute4: {

        from: 0, 

        to: 1000,

        total: LENGTH(attribute4)

      }

    }

  }),

  items: (FOR a IN docs LIMIT 2, 4 RETURN {id: a._id, name: a.name}),

  facets: {

    attribute1: (FOR a in attribute1 SORT a.count LIMIT 0, 5 return a),

    attribute2: (FOR a in attribute2 SORT a.value LIMIT 5, 10 return a),

    attribute3: (FOR a in attribute3 LIMIT 0, 1000 return a),

    attribute4: (FOR a in attribute4 SORT a.count, a.value LIMIT 0, 1000 return a)

   }

}

谢谢！

Answer 1

原来主线程发生在 ArangoDB Google 组上。这是一个link to a full discussion

以下是当前解决方案的摘要：

运行来自特定功能分支的 Arango 自定义构建，其中已完成多项性能改进（希望他们能尽快进入主版本）
分面计算不需要索引
MMFiles 是首选存储引擎
AQL 应该写成使用 "COLLECT attr = a.attributeX WITH COUNT INTO length" 而不是 "count: length(g)"
AQL 应该拆分成更小的部分并且运行并行（我们正在运行ning Java8 的 Fork/Join 来传播分面 AQL，然后将它们加入到最终结果）
一个 AQL 到 filter/sort 并检索主要实体（如果需要。而 sorting/filtering 添加相应的跳过列表索引）
其余的是每个方面的小 AQL value/frequency 对

与上面提供的原始 AQL 相比，最终我们获得了 >10 倍 的性能提升。

ArangoDB 分面搜索性能

ArangoDB Faceted Search Performance

facet

aggregation

faceted-search

arangodb