毫秒级查询 100Gb 的 S3 数据

Question

我在 s3 中有 json 数据。数据看起来像

{

    "act_timestamp": 1576480759864,
    "action": 26,
    "cmd_line": "\??\C:\Windows\system32\conhost.exe 0xffffffff",
    "guid": "45af94911fb911ea827300270e098ff0",
    "md5": "d5669294f78a7d48c318ef22d5685ba7",
    "name": "conhost.exe",
    "path": "C:\Windows\System32\conhost.exe",
    "pid": 1968,
    "sha2": "6bd1f5ab9250206ab3836529299055e272ecaa35a72cbd0230cb20ff1cc30902",
    "proc_id": "45af94901fb911ea827300270e098ff0",
    "proc_name": "gcxvdf.exe"
}

我在 s3 中存储了大约 100GB 的此类 json，文件夹结构类似于 year/month/day/hour。我必须查询此数据并在毫秒内获得结果。查询可以是这样的：-

select proc_id where name='conhost.exe',
select proc_id where cmd_line contains 'conhost.exe'.

我尝试使用 AWS Athena 和 Redshift，但两者都在 10-20 秒左右给出了结果。我什至尝试过 Paraquet 和 orc 文件格式。

有没有tool/technology/technique可以查询这种数据，毫秒级出结果

（响应时间以毫秒为单位的原因是因为我正在开发交互式应用程序。）

Answer 1

我想你正在寻找像 SOLR 或弹性搜索这样的分布式搜索系统（我相信还有其他的，但那些是我熟悉的）。

如果您能够完全减少数据大小，也值得考虑。您的 100GB 中有任何旧的或过时的日期吗？

Answer 2

我能够通过在 aws emr 上使用 presto、hive 来解决上述用例。

在 hive 的帮助下，我们可以在 s3 中的数据上创建 table，并且通过使用 presto 和 hive 作为目录，我们可以查询这些数据。发现 emr 上的 Presto 比 aws athena 快太多了（奇怪雅典娜内部使用presto）

 create table in hive:-
        CREATE EXTERNAL TABLE `test_table`( 
         `field_name1` datatype,
         `field_name2` datatype,
         `field_name3` datatype
        )
        STORED AS ORC
        LOCATION
          's3://test_data/data/';
        
 query this table in presto:-
        >presto-cli --catalog hive
        >select field_name1 from test_table limit 5;

毫秒级查询 100Gb 的 S3 数据

Query 100Gb of S3 data in milliseconds

amazon-web-services

amazon-s3

amazon-redshift

amazon-athena