不确定如何使用现有集合中的列创建 ArangoDB 图形

Not sure how to create ArangoDB graph using columns in existing collection

背景

我有一个包含三个字段的 rocksdb 集合:_id、作者、subreddit。

问题

我想创建一个 Arango 图,它创建一个连接这两个现有列的图。但是示例和驱动程序似乎只接受集合作为其边缘定义。

问题

A​​rangoDb 文档缺少有关如何使用从同一集合中提取的边和节点创建图形的信息。

编辑:

解决方案

Arangodb issues ticket 处的代码更改已解决此问题。

对于图形,您需要一个 边集合 作为边,顶点集合 作为节点。您无法仅使用一个集合来创建图表。

也许文档中的 this topic 对您有帮助。

这是一种使用 jq 的方法,一种 JSON-oriented command-line 工具。

首先,概述一下步骤:

1) 使用 arangoexport 将您的 author/subredit collection 导出到文件,例如 exported.json;

2) 运行 jq 脚本,nodes_and_edges.jq,如下所示;

3) 使用arangoimp将(2)中产生的JSON导入到ArangoDB中

有几种方法可以将图形存储在 ArangoDB 中,因此最终您可能希望相应地调整 nodes_and_edges.jq(例如,先生成节点,然后生成边)。

索引

如果你的 jq 没有定义 INDEX,那么使用这个:

def INDEX(stream; idx_expr):
  reduce stream as $row ({};
    .[$row|idx_expr|
      if type != "string" then tojson
      else .
      end] |= $row);
def INDEX(idx_expr): INDEX(.[]; idx_expr);

nodes_and_edges.jq

# This module is for generating JSON suitable for importing into ArangoDB.

### Generic Functions

# nodes/2
# $name must be the name of the ArangoDB collection of nodes corresponding to $key.
# The scheme for generating key names can be altered by changing the first
# argument of assign_keys, e.g. to "" if no prefix is wanted.
def nodes($key; $name):
  map( {($key): .[$key]} ) | assign_keys($name[0:1] + "_"; 1);

def assign_keys(prefix; start):
  . as $in
  | reduce range(0;length) as $i ([];
    . + [$in[$i] + {"_key": "\(prefix)\(start+$i)"}]);

# nodes_and_edges facilitates the normalization of an implicit graph
# in an ArangoDB "document" collection of objects having $from and $to keys.
# The input should be an array of JSON objects, as produced 
# by arangoexport for a single collection.
# If $nodesq is truthy, then the JSON for both the nodes and edges is emitted,
# otherwise only the JSON for the edges is emitted.
# 
# The first four arguments should be strings.
# 
# $from and $to should be the key names in . to be used for the from-to edges;
# $name1 and $name2 should be the names of the corresponding collections of nodes.
def nodes_and_edges($from; $to; $name1; $name2; $nodesq ):
  def dict($s): INDEX(.[$s]) | map_values(._key);
  def objects: to_entries[] | {($from): .key, "_key": .value};
  (nodes($from; $name1) | dict($from)) as $fdict
  | (nodes($to; $name2) | dict($to)  ) as $tdict
  | (if $nodesq then $fdict, $tdict | objects
     else empty end),
    (.[] | {_from: "\($name1)/\($fdict[.[$from]])",
            _to:   "\($name2)/\($tdict[.[$to]])"} )  ;


### Problem-Specific Functions

# If you wish to generate the collections separately,
# then these will come in handy:
def authors: nodes("author"; "authors");
def subredits: nodes("subredit"; "subredits");

def nodes_and_edges:
  nodes_and_edges("author"; "subredit"; "authors"; "subredits"; true);

nodes_and_edges

调用

jq -cf extract_nodes_edges.jq exported.json

此调用将为 "authors" 生成一组 JSONL (JSON-Lines),一个用于 "subredits" 和一个边 collection.

例子

exported.json
[
  {"_id":"test/115159","_key":"115159","_rev":"_V8JSdTS---","author": "A", "subredit": "S1"},
  {"_id":"test/145120","_key":"145120","_rev":"_V8ONdZa---","author": "B", "subredit": "S2"},
  {"_id":"test/114474","_key":"114474","_rev":"_V8JZJJS---","author": "C", "subredit": "S3"}
]

输出

{"author":"A","_key":"name_1"}
{"author":"B","_key":"name_2"}
{"author":"C","_key":"name_3"}
{"subredit":"S1","_key":"sid_1"}
{"subredit":"S2","_key":"sid_2"}
{"subredit":"S3","_key":"sid_3"}
{"_from":"authors/name_1","_to":"subredits/sid_1"}
{"_from":"authors/name_2","_to":"subredits/sid_2"}
{"_from":"authors/name_3","_to":"subredits/sid_3"}

这是一个 AQL 解决方案,但是它假定所有引用的集合都已经存在,并且不需要 UPSERT。

FOR v IN testcollection
  LET a = v.author
  LET s = v.subredit
  FILTER a
  FILTER s
  LET fid = (INSERT {author: a}   INTO authors RETURN NEW._id)[0]
  LET tid = (INSERT {subredit: s} INTO subredits RETURN NEW._id)[0]
  INSERT {_from: fid, _to: tid} INTO author_of
  RETURN [fid, tid]

请注意,以下查询需要一段时间才能在这个巨大的数据集上完成,但是它们应该会在几个小时后成功完成。

我们启动 arangoimp 来导入我们的基础数据集:

arangoimp --create-collection true  --collection RawSubReddits --type jsonl ./RC_2017-01 

我们使用 arangosh 创建我们的最终数据将存在的集合:

db._create("authors")
db._createEdgeCollection("authorsToSubreddits")

我们通过简单地忽略任何随后出现的重复作者来填充作者集合; 我们将使用MD5函数计算作者的_key, 所以它遵守 _key 中允许字符的限制,稍后我们可以通过在 author 字段上再次调用 MD5() 来了解它:

db._query(`
  FOR item IN RawSubReddits
    INSERT {
      _key: MD5(item.author),
      author: item.author
      } INTO authors
        OPTIONS { ignoreErrors: true }`);

在我们填充第二个顶点集合后(我们将保留导入的集合作为第一个顶点集合)我们必须计算边。 由于每个作者都可以创建多个 subred,因此它很可能是来自每个作者的多个边。就像之前提到的, 我们可以再次使用 MD5()-函数来引用之前创建的作者:

 db._query(`
   FOR onesubred IN RawSubReddits
     INSERT {
       _from: CONCAT('authors/', MD5(onesubred.author)),
       _to: CONCAT('RawSubReddits/', onesubred._key)
     } INTO  authorsToSubreddits")

边集合填满后(这可能又需要一段时间 - 我们在这里谈论的是 4000 万条边,对吗? - 我们创建图形描述:

db._graphs.save({
  "_key": "reddits",
  "orphanCollections" : [ ],
  "edgeDefinitions" : [ 
    {
      "collection": "authorsToSubreddits",
      "from": ["authors"],
      "to": ["RawSubReddits"]
    }
  ]
})

我们现在可以使用UI来浏览图表,或者使用AQL查询来浏览图表。让我们从该列表中随机选择第一作者:

db._query(`for author IN authors LIMIT 1 RETURN author`).toArray()
[ 
  { 
    "_key" : "1cec812d4e44b95e5a11f3cbb15f7980", 
    "_id" : "authors/1cec812d4e44b95e5a11f3cbb15f7980", 
    "_rev" : "_W_Eu-----_", 
    "author" : "punchyourbuns" 
  } 
]

我们确定了一位作者,现在 运行 为他 graph query

db._query(`FOR vertex, edge, path IN 0..1
   OUTBOUND 'authors/1cec812d4e44b95e5a11f3cbb15f7980'
   GRAPH 'reddits'
   RETURN path`).toArray()

结果路径之一如下所示:

{ 
  "edges" : [ 
    { 
      "_key" : "128327199", 
      "_id" : "authorsToSubreddits/128327199", 
      "_from" : "authors/1cec812d4e44b95e5a11f3cbb15f7980", 
      "_to" : "RawSubReddits/38026350", 
      "_rev" : "_W_LOxgm--F" 
    } 
  ], 
  "vertices" : [ 
    { 
      "_key" : "1cec812d4e44b95e5a11f3cbb15f7980", 
      "_id" : "authors/1cec812d4e44b95e5a11f3cbb15f7980", 
      "_rev" : "_W_HAL-y--_", 
      "author" : "punchyourbuns" 
    }, 
    { 
      "_key" : "38026350", 
      "_id" : "RawSubReddits/38026350", 
      "_rev" : "_W-JS0na--b", 
      "distinguished" : null, 
      "created_utc" : 1484537478, 
      "id" : "dchfe6e", 
      "edited" : false, 
      "parent_id" : "t1_dch51v3", 
      "body" : "I don't understand tension at all."
         "Mine is set to auto."
         "I'll replace the needle and rethread. Thanks!", 
      "stickied" : false, 
      "gilded" : 0, 
      "subreddit" : "sewing", 
      "author" : "punchyourbuns", 
      "score" : 3, 
      "link_id" : "t3_5o66d0", 
      "author_flair_text" : null, 
      "author_flair_css_class" : null, 
      "controversiality" : 0, 
      "retrieved_on" : 1486085797, 
      "subreddit_id" : "t5_2sczp" 
    } 
  ] 
}