我如何在 BigInsights on cloud Enterprise 集群上使用 python > 2.6.6 和 spark?

How can I use python > 2.6.6 with spark on BigInsights on cloud Enterprise clusters?

BigInsights python 的版本当前为 2.6.6。如何在 yarn 上的 spark 作业 运行 中使用不同版本的 Python?

请注意,云上 BigInsights 的用户没有根访问权限。

安装 Anaconda

此脚本在 BigInsights on cloud 4.2 Enterprise 集群上安装 anaconda python。 请注意,这些说明不适用于基本集群,因为您只能登录到一个 shell 节点,而不能登录任何其他节点。

通过 Ssh 进入 mastermanager 节点,然后 运行(根据您的环境更改值):

export BI_USER=snowch
export BI_PASS=changeme
export BI_HOST=bi-hadoop-prod-4118.bi.services.us-south.bluemix.net

下一个运行以下。该脚本试图尽可能地幂等,因此如果您 运行 多次它并不重要:

# abort if the script encounters an error or undeclared variables
set -euo

CLUSTER_NAME=$(curl -s -k -u $BI_USER:$BI_PASS  -X GET https://${BI_HOST}:9443/api/v1/clusters | python -c 'import sys, json; print(json.load(sys.stdin)["items"][0]["Clusters"]["cluster_name"]);')
echo Cluster Name: $CLUSTER_NAME

CLUSTER_HOSTS=$(curl -s -k -u $BI_USER:$BI_PASS  -X GET https://${BI_HOST}:9443/api/v1/clusters/${CLUSTER_NAME}/hosts | python -c 'import sys, json; items = json.load(sys.stdin)["items"]; hosts = [ item["Hosts"]["host_name"] for item in items ]; print(" ".join(hosts));')
echo Cluster Hosts: $CLUSTER_HOSTS

wget -c https://repo.continuum.io/archive/Anaconda2-4.1.1-Linux-x86_64.sh

# Install anaconda if it isn't already installed
[[ -d anaconda2 ]] || bash Anaconda2-4.1.1-Linux-x86_64.sh -b

# You can install your pip modules using something like this:
# ${HOME}/anaconda2/bin/python -c 'import yourlibrary' || ${HOME}/anaconda2/pip install yourlibrary

# Install anaconda on all of the cluster nodes
for CLUSTER_HOST in ${CLUSTER_HOSTS}; 
do 
   if [[ "$CLUSTER_HOST" != "$BI_HOST" ]];
   then
      echo "*** Processing $CLUSTER_HOST ***"
      ssh $BI_USER@$CLUSTER_HOST "wget -q -c https://repo.continuum.io/archive/Anaconda2-4.1.1-Linux-x86_64.sh"
      ssh $BI_USER@$CLUSTER_HOST "[[ -d anaconda2 ]] || bash Anaconda2-4.1.1-Linux-x86_64.sh -b"

      # You can install your pip modules on each node using something like this:
      # ssh $BI_USER@$CLUSTER_HOST "${HOME}/anaconda2/bin/python -c 'import yourlibrary' || ${HOME}/anaconda2/pip install yourlibrary"

      # Set the PYSPARK_PYTHON path on all of the nodes
      ssh $BI_USER@$CLUSTER_HOST "grep '^export PYSPARK_PYTHON=' ~/.bash_profile || echo export PYSPARK_PYTHON=${HOME}/anaconda2/bin/python2.7 >> ~/.bash_profile"
      ssh $BI_USER@$CLUSTER_HOST "sed -i -e 's;^export PYSPARK_PYTHON=.*$;export PYSPARK_PYTHON=${HOME}/anaconda2/bin/python2.7;g' ~/.bash_profile"
      ssh $BI_USER@$CLUSTER_HOST "cat ~/.bash_profile"
   fi
done

echo 'Finished installing'

运行 一个 pyspark 作业

如果您使用的是 pyspark,则可以使用 anaconda python,在 运行 执行 pyspark 命令之前设置以下变量:

export SPARK_HOME=/usr/iop/current/spark-client
export HADOOP_CONF_DIR=/usr/iop/current/hadoop-client/conf

# set these to the folders where you installed anaconda
export PYSPARK_PYTHON=/home/biadmin/anaconda2/bin/python2.7
export PYSPARK_DRIVER_PYTHON=/home/biadmin/anaconda2/bin/python2.7

spark-submit --master yarn --deploy-mode client ...

# NOTE: --deploy-mode cluster does not seem to use the PYSPARK_PYTHON setting
...

飞艇(可选)

如果您使用的是 Zeppelin (as per these instructions for BigInsights on cloud),请在 zeppelin_env.sh 中设置以下变量:

# set these to the folders where you installed anaconda
export PYSPARK_PYTHON=/home/biadmin/anaconda2/bin/python2.7
export PYSPARK_DRIVER_PYTHON=/home/biadmin/anaconda2/bin/python2.7