在 EMR 集群中执行 HiveQL

Question

我已经通过 AWS CLI 创建了一个 EMR 集群

aws emr create-cluster --applications Name=Hive Name=HBase Name=Hue Name=Hadoop Name=ZooKeeper 
 --tags Name="EMR-Atlas"  --release-label emr-5.16.0  --ec2-attributes SubnetId=subnet-xxxxx,
KeyName=atlas-emr-dif --use-default-roles --ebs-root-volume-size 100  --instance-groups 
InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m4.xlarge InstanceGroupType=CORE,InstanceCount=1,
InstanceType=m4.xlarge  --log-uri s3://xxx/logs/new-log --steps Name="Run Remote Script",
Jar=command-runner.jar,Args=
[bash,-c,
"curl https://s3.amazonaws.com/aws-bigdata-blog/artifacts/aws-blog-emr-atlas/apache-atlas-emr.sh 
-o /tmp/script.sh; chmod +x /tmp/script.sh; /tmp/script.sh"]

那么我已经为HUE建立了一个SSH连接：

--ssh -L 8888:localhost:8888 -i key.pem hadoop@<EMR Master IP Address>

我通过 HUE 创建了一个 Hive table :


CREATE external TABLE us_disease
(
YearStart int,
StratificationCategory2 string,
GeoLocation string,
ResponseID string,
LocationID int,
TopicID string
)
row format delimited
fields terminated by ','
LOCATION 's3://XXXX/data/USHealthcare/'
TBLPROPERTIES ("skip.header.line.count"="1");

我可以通过 HUE 使用 SELECT 语句获取记录。

但是，如果我尝试通过 HQL 执行 select 语句，它会失败。我尝试了以下方式：我的 HQL 很简单 SELECT statment

select * from us_disease limit 10;

并且我在 S3 中存储了与 hive.hql 相同的内容。

我在 emr 集群中执行了 hql thru 步骤：

日志：

INFO redirectError to /mnt/var/log/hadoop/steps/s-xxxxxxxx/stderr
INFO Working dir /mnt/var/lib/hadoop/steps/s-xxxxxxxx
INFO ProcessRunner started child process 30597 :
hadoop   30597  5505  0 11:40 ?        00:00:00 bash /usr/lib/hadoop/bin/hadoop jar /var/lib/aws/emr/step-runner/hadoop-jars/command-runner.jar hive-script --run-hive-script --args -f s3://dif-test/data-governance/hql/hive.hql
2021-03-30T11:40:36.318Z INFO HadoopJarStepRunner.Runner: startRun() called for s-xxxxxxxx Child Pid: 30597
INFO Synchronously wait child process to complete : hadoop jar /var/lib/aws/emr/step-runner/hadoop-...
INFO waitProcessCompletion ended with exit code 127 : hadoop jar /var/lib/aws/emr/step-runner/hadoop-...
INFO total process run time: 2 seconds
2021-03-30T11:40:36.437Z INFO Step created jobs: 
2021-03-30T11:40:36.438Z WARN Step failed with exitCode 127 and took 2 seconds

标准错误：

/usr/lib/hadoop/bin/hadoop: line 169: /etc/alternatives/jre/bin/java: No such file or directory

感谢任何帮助。谢谢。

Answer 1

我更新了 emr 版本后问题得到解决。以前我使用的是 emr-5.16.0 。我改成了emr-5.32.0.

修改后的代码：

aws emr create-cluster --applications Name=Hive Name=HBase Name=Hue Name=Hadoop Name=ZooKeeper  --tags Name="EMR-Atlas"  --release-label emr-5.32.0  --ec2-attributes SubnetId=subnet-xxxx,KeyName=atlas-emr-dif --use-default-roles --ebs-root-volume-size 100  --instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m5.xlarge InstanceGroupType=CORE,InstanceCount=2,InstanceType=m5.xlarge  --log-uri s3://xxx/xxx/new-log --steps Name="Run Remote Script",Jar=command-runner.jar,Args=[bash,-c,"curl https://s3.amazonaws.com/aws-bigdata-blog/artifacts/aws-blog-emr-atlas/apache-atlas-emr.sh -o /tmp/script.sh; chmod +x /tmp/script.sh; /tmp/script.sh"]

在 EMR 集群中执行 HiveQL

Executing HiveQL in EMR cluster

hadoop

hive

amazon-web-services

amazon-emr