在 EMR 集群中执行 HiveQL
Executing HiveQL in EMR cluster
我已经通过 AWS CLI 创建了一个 EMR 集群
aws emr create-cluster --applications Name=Hive Name=HBase Name=Hue Name=Hadoop Name=ZooKeeper
--tags Name="EMR-Atlas" --release-label emr-5.16.0 --ec2-attributes SubnetId=subnet-xxxxx,
KeyName=atlas-emr-dif --use-default-roles --ebs-root-volume-size 100 --instance-groups
InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m4.xlarge InstanceGroupType=CORE,InstanceCount=1,
InstanceType=m4.xlarge --log-uri s3://xxx/logs/new-log --steps Name="Run Remote Script",
Jar=command-runner.jar,Args=
[bash,-c,
"curl https://s3.amazonaws.com/aws-bigdata-blog/artifacts/aws-blog-emr-atlas/apache-atlas-emr.sh
-o /tmp/script.sh; chmod +x /tmp/script.sh; /tmp/script.sh"]
那么我已经为HUE建立了一个SSH连接:
--ssh -L 8888:localhost:8888 -i key.pem hadoop@<EMR Master IP Address>
我通过 HUE 创建了一个 Hive table :
CREATE external TABLE us_disease
(
YearStart int,
StratificationCategory2 string,
GeoLocation string,
ResponseID string,
LocationID int,
TopicID string
)
row format delimited
fields terminated by ','
LOCATION 's3://XXXX/data/USHealthcare/'
TBLPROPERTIES ("skip.header.line.count"="1");
我可以通过 HUE 使用 SELECT 语句获取记录。
但是,如果我尝试通过 HQL 执行 select 语句,它会失败。
我尝试了以下方式:
我的 HQL 很简单 SELECT statment
select * from us_disease limit 10;
并且我在 S3 中存储了与 hive.hql 相同的内容。
我在 emr 集群中执行了 hql thru 步骤:
日志:
INFO redirectError to /mnt/var/log/hadoop/steps/s-xxxxxxxx/stderr
INFO Working dir /mnt/var/lib/hadoop/steps/s-xxxxxxxx
INFO ProcessRunner started child process 30597 :
hadoop 30597 5505 0 11:40 ? 00:00:00 bash /usr/lib/hadoop/bin/hadoop jar /var/lib/aws/emr/step-runner/hadoop-jars/command-runner.jar hive-script --run-hive-script --args -f s3://dif-test/data-governance/hql/hive.hql
2021-03-30T11:40:36.318Z INFO HadoopJarStepRunner.Runner: startRun() called for s-xxxxxxxx Child Pid: 30597
INFO Synchronously wait child process to complete : hadoop jar /var/lib/aws/emr/step-runner/hadoop-...
INFO waitProcessCompletion ended with exit code 127 : hadoop jar /var/lib/aws/emr/step-runner/hadoop-...
INFO total process run time: 2 seconds
2021-03-30T11:40:36.437Z INFO Step created jobs:
2021-03-30T11:40:36.438Z WARN Step failed with exitCode 127 and took 2 seconds
标准错误:
/usr/lib/hadoop/bin/hadoop: line 169: /etc/alternatives/jre/bin/java: No such file or directory
感谢任何帮助。谢谢。
我更新了 emr 版本后问题得到解决。以前我使用的是 emr-5.16.0 。我改成了emr-5.32.0.
修改后的代码:
aws emr create-cluster --applications Name=Hive Name=HBase Name=Hue Name=Hadoop Name=ZooKeeper --tags Name="EMR-Atlas" --release-label emr-5.32.0 --ec2-attributes SubnetId=subnet-xxxx,KeyName=atlas-emr-dif --use-default-roles --ebs-root-volume-size 100 --instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m5.xlarge InstanceGroupType=CORE,InstanceCount=2,InstanceType=m5.xlarge --log-uri s3://xxx/xxx/new-log --steps Name="Run Remote Script",Jar=command-runner.jar,Args=[bash,-c,"curl https://s3.amazonaws.com/aws-bigdata-blog/artifacts/aws-blog-emr-atlas/apache-atlas-emr.sh -o /tmp/script.sh; chmod +x /tmp/script.sh; /tmp/script.sh"]
我已经通过 AWS CLI 创建了一个 EMR 集群
aws emr create-cluster --applications Name=Hive Name=HBase Name=Hue Name=Hadoop Name=ZooKeeper
--tags Name="EMR-Atlas" --release-label emr-5.16.0 --ec2-attributes SubnetId=subnet-xxxxx,
KeyName=atlas-emr-dif --use-default-roles --ebs-root-volume-size 100 --instance-groups
InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m4.xlarge InstanceGroupType=CORE,InstanceCount=1,
InstanceType=m4.xlarge --log-uri s3://xxx/logs/new-log --steps Name="Run Remote Script",
Jar=command-runner.jar,Args=
[bash,-c,
"curl https://s3.amazonaws.com/aws-bigdata-blog/artifacts/aws-blog-emr-atlas/apache-atlas-emr.sh
-o /tmp/script.sh; chmod +x /tmp/script.sh; /tmp/script.sh"]
那么我已经为HUE建立了一个SSH连接:
--ssh -L 8888:localhost:8888 -i key.pem hadoop@<EMR Master IP Address>
我通过 HUE 创建了一个 Hive table :
CREATE external TABLE us_disease
(
YearStart int,
StratificationCategory2 string,
GeoLocation string,
ResponseID string,
LocationID int,
TopicID string
)
row format delimited
fields terminated by ','
LOCATION 's3://XXXX/data/USHealthcare/'
TBLPROPERTIES ("skip.header.line.count"="1");
我可以通过 HUE 使用 SELECT 语句获取记录。
但是,如果我尝试通过 HQL 执行 select 语句,它会失败。 我尝试了以下方式: 我的 HQL 很简单 SELECT statment
select * from us_disease limit 10;
并且我在 S3 中存储了与 hive.hql 相同的内容。
我在 emr 集群中执行了 hql thru 步骤:
日志:
INFO redirectError to /mnt/var/log/hadoop/steps/s-xxxxxxxx/stderr
INFO Working dir /mnt/var/lib/hadoop/steps/s-xxxxxxxx
INFO ProcessRunner started child process 30597 :
hadoop 30597 5505 0 11:40 ? 00:00:00 bash /usr/lib/hadoop/bin/hadoop jar /var/lib/aws/emr/step-runner/hadoop-jars/command-runner.jar hive-script --run-hive-script --args -f s3://dif-test/data-governance/hql/hive.hql
2021-03-30T11:40:36.318Z INFO HadoopJarStepRunner.Runner: startRun() called for s-xxxxxxxx Child Pid: 30597
INFO Synchronously wait child process to complete : hadoop jar /var/lib/aws/emr/step-runner/hadoop-...
INFO waitProcessCompletion ended with exit code 127 : hadoop jar /var/lib/aws/emr/step-runner/hadoop-...
INFO total process run time: 2 seconds
2021-03-30T11:40:36.437Z INFO Step created jobs:
2021-03-30T11:40:36.438Z WARN Step failed with exitCode 127 and took 2 seconds
标准错误:
/usr/lib/hadoop/bin/hadoop: line 169: /etc/alternatives/jre/bin/java: No such file or directory
感谢任何帮助。谢谢。
我更新了 emr 版本后问题得到解决。以前我使用的是 emr-5.16.0 。我改成了emr-5.32.0.
修改后的代码:
aws emr create-cluster --applications Name=Hive Name=HBase Name=Hue Name=Hadoop Name=ZooKeeper --tags Name="EMR-Atlas" --release-label emr-5.32.0 --ec2-attributes SubnetId=subnet-xxxx,KeyName=atlas-emr-dif --use-default-roles --ebs-root-volume-size 100 --instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m5.xlarge InstanceGroupType=CORE,InstanceCount=2,InstanceType=m5.xlarge --log-uri s3://xxx/xxx/new-log --steps Name="Run Remote Script",Jar=command-runner.jar,Args=[bash,-c,"curl https://s3.amazonaws.com/aws-bigdata-blog/artifacts/aws-blog-emr-atlas/apache-atlas-emr.sh -o /tmp/script.sh; chmod +x /tmp/script.sh; /tmp/script.sh"]