请为大型关系数据分析器建议一个解决方案吗?
Can suggestion a solution for big, relational data analyzer please?
我正在寻找一些关于我的要求的建议。以下是我的要求的描述。请随时与我联系以获取任何详细信息。甚至一些关于如何更清楚地描述我的问题的建议也非常感谢:)
要求说明
我有一些数据,格式如下:
router, interface,timestamp, src_ip, dst_ip, src_port, dst_port, protocol, bits
r1, 1, 1453016443, 10.0.0.1, 10.0.0.2, 100, 200, tcp, 108
r2, 1, 1453016448, 10.0.0.3, 10.0.0.8, 200, 200, udp, 100
如您所见,是一些网络原始数据。我省略了一些列只是为了让它看起来更清楚。数据量非常大。而且生成速度非常快,每 5 分钟生成 10 亿行...
我想要的是对这些数据做一些实时分析。
例如:
使用时间戳画一条线
select sum(bits) , timestamp from raw_data group by router,interface where interface = 1, router=r1.
找出哪个3src_ip为一个接口发送的数据最多
select sum(bits) from raw_data where router=r1 and interface=2 group by src_ip order by sum(bits) desc limit 3
我已经尝试了一些解决方案,但每一个都不是很适合它。例如:
rdbms
MySQL 看起来不错,除了几个问题:
- the data is too big
- I`m having a lot more columns than I described here. To improve my query speed, I have to some index on most of the columns. But i think create index on big table and the index containing too many columns is not very good, right?
打开TSDB
OpenTSDB 是一个很好的时间序列数据库。但是也不符合我的要求。
openTSDB is having problem to solve the TOP N problem. In my requirements "to get top 3 src_ip which sending most data", openTSDB can not resolve this.
火花
I know that apache spark can be used like RDBMS. It having the feature called spark SQL. I did not try but I guess the performance should not satisfy the real time analysis/query requirement, right? After all, spark is more suitable for offline calculation, right?
弹性搜索
I really give a lot hope on ES when I know this project. But it is not suitable either. Because When you aggregating more than one column, you have to use the so called nested bucket aggregation in elasticsearch. And the result of this aggregation can not be sorted. You have to retrieve all the result and sort by your self. In my case, the result is too much. To sort the result will be very difficult
所以....我被困在这里了。任何人都可以提出一些建议吗?
我不明白为什么ES不能达到你的要求。我想你误解了这部分
But it is not suitable either. Because When you aggregating more than one column, you have to use the so called nested bucket aggregation in elasticsearch. And the result of this aggregation can not be sorted.
您的第一个要求使用时间戳画一条线 可以通过 query/aggregation 轻松实现,如下所示:
{
"query": {
"bool": {
"must": [
{
"term": {
"interface": 1
}
},
{
"term": {
"router": "r1"
}
}
]
}
},
"aggs": {
"by_minute": {
"date_histogram": {
"field": "timestamp",
"interval": "1m"
},
"aggs": {
"sum_bits": {
"sum": {
"field": "bits"
}
}
}
}
}
}
至于你的第二个要求找出哪个src_ip为一个接口发送最多的数据,也可以很容易地用query/aggregation实现像这个:
{
"query": {
"bool": {
"must": [
{
"term": {
"interface": 2
}
},
{
"term": {
"router": "r1"
}
}
]
}
},
"aggs": {
"by_src_ip": {
"terms": {
"field": "src_ip",
"size": 3,
"order": {
"sum_bits": "desc"
}
},
"aggs": {
"sum_bits": {
"sum": {
"field": "bits"
}
}
}
}
}
}
更新
根据您的评论,您上面的第二个要求可能会更改为找到 src_ip/dst_ip 的前 3 个组合。这可以通过 terms
聚合使用 script
而不是构建 src/dest 组合并为每对提供位总和的术语来实现,如下所示:
{
"query": {
"bool": {
"must": [
{
"term": {
"interface": 2
}
},
{
"term": {
"router": "r1"
}
}
]
}
},
"aggs": {
"by_src_ip": {
"terms": {
"script": "[doc.src_ip.value, doc.dst_ip.value].join('-')",
"size": 3,
"order": {
"sum_bits": "desc"
}
},
"aggs": {
"sum_bits": {
"sum": {
"field": "bits"
}
}
}
}
}
}
请注意,为了 运行 最后一个查询,您需要 enable dynamic scripting。此外,由于您将拥有数十亿个文档,因此编写脚本可能不是最佳解决方案,但值得在进一步研究之前尝试一下。另一种可能的解决方案是在索引时添加一个 combination
字段 (src_ip-dst_ip
),这样您就可以将其用作术语聚合中的一个字段,而无需诉诸脚本。
您可以尝试 Axibase Time Series Database,它是 non-relational,但除了 rest-like API 之外还支持 SQL 查询。这是一个 Top-N 查询示例:
SELECT entity, avg(value) FROM cpu_busy
WHERE time between now - 1 * hour and now
GROUP BY entity
ORDER BY avg(value) DESC
LIMIT 3
https://axibase.com/docs/atsd/sql/#grouping
ATSD 社区版是免费的。
披露:我为 Axibase 工作
我正在寻找一些关于我的要求的建议。以下是我的要求的描述。请随时与我联系以获取任何详细信息。甚至一些关于如何更清楚地描述我的问题的建议也非常感谢:)
要求说明
我有一些数据,格式如下:
router, interface,timestamp, src_ip, dst_ip, src_port, dst_port, protocol, bits
r1, 1, 1453016443, 10.0.0.1, 10.0.0.2, 100, 200, tcp, 108
r2, 1, 1453016448, 10.0.0.3, 10.0.0.8, 200, 200, udp, 100
如您所见,是一些网络原始数据。我省略了一些列只是为了让它看起来更清楚。数据量非常大。而且生成速度非常快,每 5 分钟生成 10 亿行...
我想要的是对这些数据做一些实时分析。 例如:
使用时间戳画一条线
select sum(bits) , timestamp from raw_data group by router,interface where interface = 1, router=r1.
找出哪个3src_ip为一个接口发送的数据最多
select sum(bits) from raw_data where router=r1 and interface=2 group by src_ip order by sum(bits) desc limit 3
我已经尝试了一些解决方案,但每一个都不是很适合它。例如:
rdbms
MySQL 看起来不错,除了几个问题:
- the data is too big
- I`m having a lot more columns than I described here. To improve my query speed, I have to some index on most of the columns. But i think create index on big table and the index containing too many columns is not very good, right?
打开TSDB
OpenTSDB 是一个很好的时间序列数据库。但是也不符合我的要求。
openTSDB is having problem to solve the TOP N problem. In my requirements "to get top 3 src_ip which sending most data", openTSDB can not resolve this.
火花
I know that apache spark can be used like RDBMS. It having the feature called spark SQL. I did not try but I guess the performance should not satisfy the real time analysis/query requirement, right? After all, spark is more suitable for offline calculation, right?
弹性搜索
I really give a lot hope on ES when I know this project. But it is not suitable either. Because When you aggregating more than one column, you have to use the so called nested bucket aggregation in elasticsearch. And the result of this aggregation can not be sorted. You have to retrieve all the result and sort by your self. In my case, the result is too much. To sort the result will be very difficult
所以....我被困在这里了。任何人都可以提出一些建议吗?
我不明白为什么ES不能达到你的要求。我想你误解了这部分
But it is not suitable either. Because When you aggregating more than one column, you have to use the so called nested bucket aggregation in elasticsearch. And the result of this aggregation can not be sorted.
您的第一个要求使用时间戳画一条线 可以通过 query/aggregation 轻松实现,如下所示:
{
"query": {
"bool": {
"must": [
{
"term": {
"interface": 1
}
},
{
"term": {
"router": "r1"
}
}
]
}
},
"aggs": {
"by_minute": {
"date_histogram": {
"field": "timestamp",
"interval": "1m"
},
"aggs": {
"sum_bits": {
"sum": {
"field": "bits"
}
}
}
}
}
}
至于你的第二个要求找出哪个src_ip为一个接口发送最多的数据,也可以很容易地用query/aggregation实现像这个:
{
"query": {
"bool": {
"must": [
{
"term": {
"interface": 2
}
},
{
"term": {
"router": "r1"
}
}
]
}
},
"aggs": {
"by_src_ip": {
"terms": {
"field": "src_ip",
"size": 3,
"order": {
"sum_bits": "desc"
}
},
"aggs": {
"sum_bits": {
"sum": {
"field": "bits"
}
}
}
}
}
}
更新
根据您的评论,您上面的第二个要求可能会更改为找到 src_ip/dst_ip 的前 3 个组合。这可以通过 terms
聚合使用 script
而不是构建 src/dest 组合并为每对提供位总和的术语来实现,如下所示:
{
"query": {
"bool": {
"must": [
{
"term": {
"interface": 2
}
},
{
"term": {
"router": "r1"
}
}
]
}
},
"aggs": {
"by_src_ip": {
"terms": {
"script": "[doc.src_ip.value, doc.dst_ip.value].join('-')",
"size": 3,
"order": {
"sum_bits": "desc"
}
},
"aggs": {
"sum_bits": {
"sum": {
"field": "bits"
}
}
}
}
}
}
请注意,为了 运行 最后一个查询,您需要 enable dynamic scripting。此外,由于您将拥有数十亿个文档,因此编写脚本可能不是最佳解决方案,但值得在进一步研究之前尝试一下。另一种可能的解决方案是在索引时添加一个 combination
字段 (src_ip-dst_ip
),这样您就可以将其用作术语聚合中的一个字段,而无需诉诸脚本。
您可以尝试 Axibase Time Series Database,它是 non-relational,但除了 rest-like API 之外还支持 SQL 查询。这是一个 Top-N 查询示例:
SELECT entity, avg(value) FROM cpu_busy
WHERE time between now - 1 * hour and now
GROUP BY entity
ORDER BY avg(value) DESC
LIMIT 3
https://axibase.com/docs/atsd/sql/#grouping
ATSD 社区版是免费的。
披露:我为 Axibase 工作