如何使用 Oozie 安排 Rscripts
How to Schedule Rscripts using Oozie
我在 Hortonworks Sandbox 上使用 Rhadoop 将数据从 HDFS 读取到 R,在 R 中读取数据后,我正在对该文件执行某些操作。
我想使用 Oozie 安排(每天、每周、每月)这个 R 脚本。
非常感谢任何帮助。
谢谢
好像有人帮你做了:
这是来自 Oozie R helper on Github 的相关 bash 脚本和使用说明。
#!/bin/bash
die () {
echo >&2 "$@"
exit 1
}
[ "$#" -eq 3 ] || die "3 arguments required, $# provided"
hdfs_file=
r_file=
hdfs_output=
if [[ ${hdfs_output} =~ ^\/tmp\/.*$ ]]; then
echo "I will run the r script on the hdfs "
tmp_filename="/tmp/`date +"%Y%m%d.%H%M%S"`"
echo "using tmp_dir $tmp_filename"
tmp_output="/tmp/out`date +"%Y%m%d.%H%M%S"`"
hadoop fs -getmerge $hdfs_file $tmp_filename
R -f $r_file --args $tmp_filename $tmp_output
hadoop fs -rmr $hdfs_output
hadoop fs -put $tmp_output $hdfs_output
else
die "$hdfs_output must be in /tmp/"
fi
Oozie R helper,
The data science team wanted to be able to run R script using oozie,
They wanted to be able to run ETL using Hive and on the result of that
ETL they wanted to run the r script.
So I created a bash script that take 3 arguments: 1. The HDFS input of
the files they want to run 2. The R script they want to run 3. The
output on the hdfs where they want their result to be. (currentlt
because the user is mapred I allow only /tmp/)
How to run
You can use a shell oozie action like this:
<shell xmlns="uri:oozie:shell-action:0.1">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<exec>run_r_hadoop.sh</exec>
<argument>/user/hive/warehouse/dual</argument>
<argument>count.r</argument>
<argument>/tmp/r_test</argument>
<file>count.r#count.r</file>
</shell>
Prequesite
R and all its libraries should be installe on all Hadoop salves,
including all the libraries that are used
我在 Hortonworks Sandbox 上使用 Rhadoop 将数据从 HDFS 读取到 R,在 R 中读取数据后,我正在对该文件执行某些操作。
我想使用 Oozie 安排(每天、每周、每月)这个 R 脚本。
非常感谢任何帮助。
谢谢
好像有人帮你做了:
这是来自 Oozie R helper on Github 的相关 bash 脚本和使用说明。
#!/bin/bash
die () {
echo >&2 "$@"
exit 1
}
[ "$#" -eq 3 ] || die "3 arguments required, $# provided"
hdfs_file=
r_file=
hdfs_output=
if [[ ${hdfs_output} =~ ^\/tmp\/.*$ ]]; then
echo "I will run the r script on the hdfs "
tmp_filename="/tmp/`date +"%Y%m%d.%H%M%S"`"
echo "using tmp_dir $tmp_filename"
tmp_output="/tmp/out`date +"%Y%m%d.%H%M%S"`"
hadoop fs -getmerge $hdfs_file $tmp_filename
R -f $r_file --args $tmp_filename $tmp_output
hadoop fs -rmr $hdfs_output
hadoop fs -put $tmp_output $hdfs_output
else
die "$hdfs_output must be in /tmp/"
fi
Oozie R helper,
The data science team wanted to be able to run R script using oozie,
They wanted to be able to run ETL using Hive and on the result of that ETL they wanted to run the r script.
So I created a bash script that take 3 arguments: 1. The HDFS input of the files they want to run 2. The R script they want to run 3. The output on the hdfs where they want their result to be. (currentlt because the user is mapred I allow only /tmp/)
How to run
You can use a shell oozie action like this:
<shell xmlns="uri:oozie:shell-action:0.1"> <job-tracker>${jobTracker}</job-tracker> <name-node>${nameNode}</name-node> <exec>run_r_hadoop.sh</exec> <argument>/user/hive/warehouse/dual</argument> <argument>count.r</argument> <argument>/tmp/r_test</argument> <file>count.r#count.r</file> </shell>
Prequesite
R and all its libraries should be installe on all Hadoop salves, including all the libraries that are used