从 Hive 自动插入到 Elasticsearch

Question

我目前正在尝试找到一种方法来自动将 Hadoop 文本文件中的数据添加到 elasticsearch 中。我们是运行 HIVE v0.11、Hadoop v2.0.5、Elasticsearch 1.7.1 和 elasticsearch-hadoop v2.1.0 这些文件存储在名为 year/month/day 的路径 /tmp/test-log/apache2log 下的不同子文件夹中此 table 创建用于从 Hadoop 获取数据：

CREATE EXTERNAL TABLE apache2log(
userIP STRING,
identity STRING,
user STRING,
time STRING,
request STRING,
status STRING,
thread STRING,
link STRING,
callerInformation STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED by '|'
LOCATION '/tmp/test-log/apache2log';

但是当我尝试创建一个 table，将此数据插入 elasticsearch 时，创建工作正常，但 table 是空的。我尝试了以下命令：

CREATE EXTERNAL TABLE apache2log(
userIP STRING,
identity STRING,
user STRING,
time STRING,
request STRING,
status STRING,
thread STRING,
link STRING,
callerInformation STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED by '|'
STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler'
LOCATION '/tmp/test-log/apache2log'
TBLPROPERTIES(
'es.nodes'='1.2.3.4', 
'es.resource'='sam3/apache2',
'es.net.proxy.http.use.system.props'='false');

从默认设置更改的变量：

SET hive.input.dir.recursive=true;
SET hive.mapred.supports.subdirectories = true;
SET hive.supports.subdirectories=true;
SET mapred.input.dir.recursive = true;
SET hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat;

ADD JAR /usr/lib/gphd/hive-0.11.0_gphd_2_1_1_0/lib/elasticsearch-hadoop-2.1.0.jar;

我知道，可以为 elasticsearch 创建第二个 table 并使用 INSERT 添加数据。但是我需要这个过程是自动化的，所以添加到文件中的数据应该在它到达 hadoop 时插入 table。

Answer 1

这个我觉得没办法。如果是这样，就不需要在单独的外部表中定义表的存储处理程序。

从 Hive 自动插入到 Elasticsearch

Automated insertion from Hive to Elasticsearch

hadoop

hive

elasticsearch

elasticsearch-hadoop