使用 Apache Drill 搜索 Firebase JSON
Searching Firebase JSON with Apache Drill
我已经从 https://domain.firebaseio.com/users/
导出了部分数据
{
"3": {
"company": "",
"d_year": "",
"email": "mario.giambanco@domain.com",
"facebook": "",
"fullname": "Mario Test",
"google": "",
"igoto": "",
"image": "",
"notifications": {
"-Jx6fpaJHvKPHc8CylPd": {
"from": "System",
"image": "/img/system_icon.jpg",
"msg": "System:",
"param": "3",
"posteddate": 1440016723546,
"type": "system"
}
},
"school": "",
"school_year": "",
"tags": {
"-JxWuEPs183UEwsI-XNb": {
"title": "Anesthesia"
},
"-JxWuZ-ePcx0XqYRmzc6": {
"title": "Bridges"
}
},
"twitter": ""
},
"4": {
"company": "",
"d_year": "",
"email": "mariogiambanco@domain.com",
"fullname": "mario test",
"igoto": "",
"image": "img/a0.jpg",
"notifications": {
"-JxAQpWGzY-gOzej7Xis": {
"from": "System",
"image": "/img/system_icon.jpg",
"msg": "System:",
"param": "4",
"posteddate": 1440079641420,
"type": "system"
}
},
"school": "",
"school_year": ""
}
}
正在执行:
SELECT * 来自 dfs./Users/me/Desktop/users.json
有效(或者,至少我得到了结果)
但是我如何将列映射为行中的值。从关系数据库世界来看,屏幕截图中的列标题是唯一 ID (3, 4) - 这些应该是行的一部分,而不是列标题。使用 push({})
时生成的唯一生成的密钥也是如此
目标当然是做一个 Select Where (select * from data where fullname="Mario Test") 例如
在使用 Drill 搜索 JSON 之前,我应该对 pre-processing 做些什么吗?
如果键“3”和“4”实际上是 ID,则键不应真正包含值。为 Drill 格式化此 JSON 的更好方法是为这些值使用实际键(还要注意每个文件可以有多个记录,Drill 可以解析它们):
{ "id": 3,
"data": {
...
}
}
{ "id": 4,
"data": {
...
}
}
这样你就可以进行这样的查询:
> select t.`id`, t.`data`.`fullname` as `fullname` from `firebase.json` t;
+-----+-------------+
| id | fullname |
+-----+-------------+
| 3 | Mario Test |
| 4 | mario test |
+-----+-------------+
2 rows selected (0.269 seconds)
可能还有另一种方法可以做到这一点,但我会说是的,您可能想要稍微转换数据以便使用 Drill 查询它。
这看起来像是您要使用 KVGEN 的情况。 KVGEN 会为您提供 Chris Matta 所描述的那种列,但 KVGEN 对列进行操作,在这种情况下实际上并没有要使用的列:
0: jdbc:drill:zk=local> select t.* from dfs.`/Users/vince/data/Whosebug/users.json` t;
+---+---+
| 3 | 4 |
+---+---+
| {"company":"","d_year":"","email":"mario.giambanco@domain.com","facebook":"","fullname":"Mario Test","google":"","igoto":"","image":"","notifications":{"-Jx6fpaJHvKPHc8CylPd":{"from":"System","image":"/img/system_icon.jpg","msg":"System:","param":"3","posteddate":1440016723546,"type":"system"}},"school":"","school_year":"","tags":{"-JxWuEPs183UEwsI-XNb":{"title":"Anesthesia"},"-JxWuZ-ePcx0XqYRmzc6":{"title":"Bridges"}},"twitter":""} | {"company":"","d_year":"","email":"mariogiambanco@domain.com","fullname":"mario test","igoto":"","image":"img/a0.jpg","notifications":{"-JxAQpWGzY-gOzej7Xis":{"from":"System","image":"/img/system_icon.jpg","msg":"System:","param":"4","posteddate":1440079641420,"type":"system"}},"school":"","school_year":""} |
+---+---+
1 row selected (0.133 seconds)
由于这些列是动态的并且位于 JSON 对象的 "top level" 处,因此您不能在此处使用 KVGEN。但是如果你只是稍微转换一下数据,你可以使用 KVGEN。我使用了这个最优秀的工具 jq 的调用将数据按摩成 KVGEN 可以使用的格式:
$ jq '.| { "user": . }' < users.json > users_kv.json
这将获取输入,并将 JSON 对象包装在另一个映射中,这将为我们提供 "static" 列,我们需要执行以下操作:
0: jdbc:drill:zk=local> select kvgen(t.`user`) from dfs.`/Users/vince/data/Whosebug/users_kv.json` t;
+--------+
| EXPR[=12=] |
+--------+
| [{"key":"3","value":{"company":"","d_year":"","email":"mario.giambanco@domain.com","facebook":"","fullname":"Mario Test","google":"","igoto":"","image":"","notifications":{"-Jx6fpaJHvKPHc8CylPd":{"from":"System","image":"/img/system_icon.jpg","msg":"System:","param":"3","posteddate":1440016723546,"type":"system"},"-JxAQpWGzY-gOzej7Xis":{}},"school":"","school_year":"","tags":{"-JxWuEPs183UEwsI-XNb":{"title":"Anesthesia"},"-JxWuZ-ePcx0XqYRmzc6":{"title":"Bridges"}},"twitter":""}},{"key":"4","value":{"company":"","d_year":"","email":"mariogiambanco@domain.com","fullname":"mario test","igoto":"","image":"img/a0.jpg","notifications":{"-Jx6fpaJHvKPHc8CylPd":{},"-JxAQpWGzY-gOzej7Xis":{"from":"System","image":"/img/system_icon.jpg","msg":"System:","param":"4","posteddate":1440079641420,"type":"system"}},"school":"","school_year":"","tags":{"-JxWuEPs183UEwsI-XNb":{},"-JxWuZ-ePcx0XqYRmzc6":{}}}}] |
+--------+
1 row selected (1.774 seconds)
仍然不能真正按照您想要的方式查询,因为我在列中有一个列表。所以使用 FLATTEN:
0: jdbc:drill:zk=local> select flatten(kvgen(t.`user`)) as `user` from dfs.`/Users/vince/data/Whosebug/users_kv.json` t;
+------+
| user |
+------+
| {"key":"3","value":{"company":"","d_year":"","email":"mario.giambanco@domain.com","facebook":"","fullname":"Mario Test","google":"","igoto":"","image":"","notifications":{"-Jx6fpaJHvKPHc8CylPd":{"from":"System","image":"/img/system_icon.jpg","msg":"System:","param":"3","posteddate":1440016723546,"type":"system"},"-JxAQpWGzY-gOzej7Xis":{}},"school":"","school_year":"","tags":{"-JxWuEPs183UEwsI-XNb":{"title":"Anesthesia"},"-JxWuZ-ePcx0XqYRmzc6":{"title":"Bridges"}},"twitter":""}} |
| {"key":"4","value":{"company":"","d_year":"","email":"mariogiambanco@domain.com","fullname":"mario test","igoto":"","image":"img/a0.jpg","notifications":{"-Jx6fpaJHvKPHc8CylPd":{},"-JxAQpWGzY-gOzej7Xis":{"from":"System","image":"/img/system_icon.jpg","msg":"System:","param":"4","posteddate":1440079641420,"type":"system"}},"school":"","school_year":"","tags":{"-JxWuEPs183UEwsI-XNb":{},"-JxWuZ-ePcx0XqYRmzc6":{}}}} |
+------+
2 rows selected (0.257 seconds)
两排 - 好多了。现在你已经准备好做你想做的事了(注意子查询和保留字用户和值周围的反引号:
0: jdbc:drill:zk=local> select u.`user`.`key` as userid, u.`user`.`value`.fullname as fullname, u.`user`.`value`.email as email from (select flatten(kvgen(t.`user`)) as `user` from dfs.`/Users/vince/data/Whosebug/users_kv.json` t) u where u.`user`.`value`.fullname = 'Mario Test';
+---------+-------------+-----------------------------+
| userid | fullname | email |
+---------+-------------+-----------------------------+
| 3 | Mario Test | mario.giambanco@domain.com |
+---------+-------------+-----------------------------+
1 row selected (0.22 seconds)
我已经从 https://domain.firebaseio.com/users/
导出了部分数据{
"3": {
"company": "",
"d_year": "",
"email": "mario.giambanco@domain.com",
"facebook": "",
"fullname": "Mario Test",
"google": "",
"igoto": "",
"image": "",
"notifications": {
"-Jx6fpaJHvKPHc8CylPd": {
"from": "System",
"image": "/img/system_icon.jpg",
"msg": "System:",
"param": "3",
"posteddate": 1440016723546,
"type": "system"
}
},
"school": "",
"school_year": "",
"tags": {
"-JxWuEPs183UEwsI-XNb": {
"title": "Anesthesia"
},
"-JxWuZ-ePcx0XqYRmzc6": {
"title": "Bridges"
}
},
"twitter": ""
},
"4": {
"company": "",
"d_year": "",
"email": "mariogiambanco@domain.com",
"fullname": "mario test",
"igoto": "",
"image": "img/a0.jpg",
"notifications": {
"-JxAQpWGzY-gOzej7Xis": {
"from": "System",
"image": "/img/system_icon.jpg",
"msg": "System:",
"param": "4",
"posteddate": 1440079641420,
"type": "system"
}
},
"school": "",
"school_year": ""
}
}
正在执行:
SELECT * 来自 dfs./Users/me/Desktop/users.json
有效(或者,至少我得到了结果)
但是我如何将列映射为行中的值。从关系数据库世界来看,屏幕截图中的列标题是唯一 ID (3, 4) - 这些应该是行的一部分,而不是列标题。使用 push({})
时生成的唯一生成的密钥也是如此目标当然是做一个 Select Where (select * from data where fullname="Mario Test") 例如
在使用 Drill 搜索 JSON 之前,我应该对 pre-processing 做些什么吗?
如果键“3”和“4”实际上是 ID,则键不应真正包含值。为 Drill 格式化此 JSON 的更好方法是为这些值使用实际键(还要注意每个文件可以有多个记录,Drill 可以解析它们):
{ "id": 3,
"data": {
...
}
}
{ "id": 4,
"data": {
...
}
}
这样你就可以进行这样的查询:
> select t.`id`, t.`data`.`fullname` as `fullname` from `firebase.json` t;
+-----+-------------+
| id | fullname |
+-----+-------------+
| 3 | Mario Test |
| 4 | mario test |
+-----+-------------+
2 rows selected (0.269 seconds)
可能还有另一种方法可以做到这一点,但我会说是的,您可能想要稍微转换数据以便使用 Drill 查询它。
这看起来像是您要使用 KVGEN 的情况。 KVGEN 会为您提供 Chris Matta 所描述的那种列,但 KVGEN 对列进行操作,在这种情况下实际上并没有要使用的列:
0: jdbc:drill:zk=local> select t.* from dfs.`/Users/vince/data/Whosebug/users.json` t;
+---+---+
| 3 | 4 |
+---+---+
| {"company":"","d_year":"","email":"mario.giambanco@domain.com","facebook":"","fullname":"Mario Test","google":"","igoto":"","image":"","notifications":{"-Jx6fpaJHvKPHc8CylPd":{"from":"System","image":"/img/system_icon.jpg","msg":"System:","param":"3","posteddate":1440016723546,"type":"system"}},"school":"","school_year":"","tags":{"-JxWuEPs183UEwsI-XNb":{"title":"Anesthesia"},"-JxWuZ-ePcx0XqYRmzc6":{"title":"Bridges"}},"twitter":""} | {"company":"","d_year":"","email":"mariogiambanco@domain.com","fullname":"mario test","igoto":"","image":"img/a0.jpg","notifications":{"-JxAQpWGzY-gOzej7Xis":{"from":"System","image":"/img/system_icon.jpg","msg":"System:","param":"4","posteddate":1440079641420,"type":"system"}},"school":"","school_year":""} |
+---+---+
1 row selected (0.133 seconds)
由于这些列是动态的并且位于 JSON 对象的 "top level" 处,因此您不能在此处使用 KVGEN。但是如果你只是稍微转换一下数据,你可以使用 KVGEN。我使用了这个最优秀的工具 jq 的调用将数据按摩成 KVGEN 可以使用的格式:
$ jq '.| { "user": . }' < users.json > users_kv.json
这将获取输入,并将 JSON 对象包装在另一个映射中,这将为我们提供 "static" 列,我们需要执行以下操作:
0: jdbc:drill:zk=local> select kvgen(t.`user`) from dfs.`/Users/vince/data/Whosebug/users_kv.json` t;
+--------+
| EXPR[=12=] |
+--------+
| [{"key":"3","value":{"company":"","d_year":"","email":"mario.giambanco@domain.com","facebook":"","fullname":"Mario Test","google":"","igoto":"","image":"","notifications":{"-Jx6fpaJHvKPHc8CylPd":{"from":"System","image":"/img/system_icon.jpg","msg":"System:","param":"3","posteddate":1440016723546,"type":"system"},"-JxAQpWGzY-gOzej7Xis":{}},"school":"","school_year":"","tags":{"-JxWuEPs183UEwsI-XNb":{"title":"Anesthesia"},"-JxWuZ-ePcx0XqYRmzc6":{"title":"Bridges"}},"twitter":""}},{"key":"4","value":{"company":"","d_year":"","email":"mariogiambanco@domain.com","fullname":"mario test","igoto":"","image":"img/a0.jpg","notifications":{"-Jx6fpaJHvKPHc8CylPd":{},"-JxAQpWGzY-gOzej7Xis":{"from":"System","image":"/img/system_icon.jpg","msg":"System:","param":"4","posteddate":1440079641420,"type":"system"}},"school":"","school_year":"","tags":{"-JxWuEPs183UEwsI-XNb":{},"-JxWuZ-ePcx0XqYRmzc6":{}}}}] |
+--------+
1 row selected (1.774 seconds)
仍然不能真正按照您想要的方式查询,因为我在列中有一个列表。所以使用 FLATTEN:
0: jdbc:drill:zk=local> select flatten(kvgen(t.`user`)) as `user` from dfs.`/Users/vince/data/Whosebug/users_kv.json` t;
+------+
| user |
+------+
| {"key":"3","value":{"company":"","d_year":"","email":"mario.giambanco@domain.com","facebook":"","fullname":"Mario Test","google":"","igoto":"","image":"","notifications":{"-Jx6fpaJHvKPHc8CylPd":{"from":"System","image":"/img/system_icon.jpg","msg":"System:","param":"3","posteddate":1440016723546,"type":"system"},"-JxAQpWGzY-gOzej7Xis":{}},"school":"","school_year":"","tags":{"-JxWuEPs183UEwsI-XNb":{"title":"Anesthesia"},"-JxWuZ-ePcx0XqYRmzc6":{"title":"Bridges"}},"twitter":""}} |
| {"key":"4","value":{"company":"","d_year":"","email":"mariogiambanco@domain.com","fullname":"mario test","igoto":"","image":"img/a0.jpg","notifications":{"-Jx6fpaJHvKPHc8CylPd":{},"-JxAQpWGzY-gOzej7Xis":{"from":"System","image":"/img/system_icon.jpg","msg":"System:","param":"4","posteddate":1440079641420,"type":"system"}},"school":"","school_year":"","tags":{"-JxWuEPs183UEwsI-XNb":{},"-JxWuZ-ePcx0XqYRmzc6":{}}}} |
+------+
2 rows selected (0.257 seconds)
两排 - 好多了。现在你已经准备好做你想做的事了(注意子查询和保留字用户和值周围的反引号:
0: jdbc:drill:zk=local> select u.`user`.`key` as userid, u.`user`.`value`.fullname as fullname, u.`user`.`value`.email as email from (select flatten(kvgen(t.`user`)) as `user` from dfs.`/Users/vince/data/Whosebug/users_kv.json` t) u where u.`user`.`value`.fullname = 'Mario Test';
+---------+-------------+-----------------------------+
| userid | fullname | email |
+---------+-------------+-----------------------------+
| 3 | Mario Test | mario.giambanco@domain.com |
+---------+-------------+-----------------------------+
1 row selected (0.22 seconds)