使用 bdutil 在现有 GCE hadoop/spark 集群中添加或删除节点

Adding or removing nodes from an existing GCE hadoop/spark cluster with bdutil

我开始在 google 计算引擎上 运行 设置一个 spark 集群,由 google 云存储支持,使用 bdutil 部署(在 GoogleCloudPlatform github), 我这样做如下:

./bdutil -e bigquery_env.sh,datastore_env.sh,extensions/spark/spark_env.sh -b myhdfsbucket deploy

我预计我可能想从 2 个节点的集群开始(默认),然后想添加另一个工作节点来处理需要 运行 的大型作业。如果可能的话,我想在不完全破坏和重新部署集群的情况下执行此操作。

我已经尝试使用相同的命令重新部署不同数量的节点,或者 运行 宁 "create" 和 "run_command_group install_connectors",如下所示,但是对于其中的每一个我收到有关已存在节点的错误,例如

./bdutil -n 3 -e bigquery_env.sh,datastore_env.sh,extensions/spark/spark_env.sh -b myhdfsbucket deploy

./bdutil -n 3 -b myhdfsbucket create
./bdutil -n 3 -t workers -b myhdfsbucket run_command_group install_connectors

我也尝试过对其中一名工作人员进行快照和克隆 运行ning,但并非所有服务似乎都可以正确启动,这让我有点不知所措。

关于我如何 could/should 添加 and/or 从现有集群中删除节点的任何指导?

更新: 我们将 resize_env.sh 添加到基础 bdutil repo 所以你不需要再去我的分支了

原回答:

目前还没有对调整 bdutil 部署的集群大小的官方支持,但这肯定是我们之前讨论过的事情,事实上,为调整大小提供一些基本支持是相当可行的。一旦合并到主分支中,这可能会采用不同的形式,但我已经将调整大小支持的初稿推送到 my fork of bdutil. This was implemented across two commits; one to allow skipping all "master" operations (including create, run_command, delete, etc) and another to add the resize_env.sh file

我还没有针对其他 bdutil 扩展的所有组合测试它,但我至少成功地 运行 它与基础 bdutil_env.shextensions/spark/spark_env.sh。从理论上讲,它也应该适用于您的 bigquery 和数据存储扩展。在您的案例中使用它:

# Assuming you initially deployed with this command (default n == 2)
./bdutil -e bigquery_env.sh,datastore_env.sh,extensions/spark/spark_env.sh -b myhdfsbucket -n 2 deploy

# Before this step, edit resize_env.sh and set NEW_NUM_WORKERS to what you want.
# Currently it defaults to 5.
# Deploy only the new workers, e.g. {hadoop-w-2, hadoop-w-3, hadoop-w-4}:
./bdutil -e bigquery_env.sh,datastore_env.sh,extensions/spark/spark_env.sh -b myhdfsbucket -n 2 -e resize_env.sh deploy

# Explicitly start the Hadoop daemons on just the new workers:
./bdutil -e bigquery_env.sh,datastore_env.sh,extensions/spark/spark_env.sh -b myhdfsbucket -n 2 -e resize_env.sh run_command -t workers -- "service hadoop-hdfs-datanode start && service hadoop-mapreduce-tasktracker start"

# If using Spark as well, explicitly start the Spark daemons on the new workers:
./bdutil -e bigquery_env.sh,datastore_env.sh,extensions/spark/spark_env.sh -b myhdfsbucket -n 2 -e resize_env.sh run_command -t workers -u extensions/spark/start_single_spark_worker.sh -- "./start_single_spark_worker.sh"

# From now on, it's as if you originally turned up your cluster with "-n 5".
# When deleting, remember to include those extra workers:
./bdutil -b myhdfsbucket -n 5 delete

一般来说,最佳实践建议是将您的配置压缩到一个文件中,而不是总是传递标志。例如,在您的情况下,您可能需要一个名为 my_base_env.sh:

的文件
import_env bigquery_env.sh
import_env datastore_env.sh
import_env extensions/spark/spark_env.sh

NUM_WORKERS=2
CONFIGBUCKET=myhdfsbucket

那么调整大小的命令就短多了:

# Assuming you initially deployed with this command (default n == 2)
./bdutil -e my_base_env.sh deploy

# Before this step, edit resize_env.sh and set NEW_NUM_WORKERS to what you want.
# Currently it defaults to 5.
# Deploy only the new workers, e.g. {hadoop-w-2, hadoop-w-3, hadoop-w-4}:
./bdutil -e my_base_env.sh -e resize_env.sh deploy

# Explicitly start the Hadoop daemons on just the new workers:
./bdutil -e my_base_env.sh -e resize_env.sh run_command -t workers -- "service hadoop-hdfs-datanode start && service hadoop-mapreduce-tasktracker start"

# If using Spark as well, explicitly start the Spark daemons on the new workers:
./bdutil -e my_base_env.sh -e resize_env.sh run_command -t workers -u extensions/spark/start_single_spark_worker.sh -- "./start_single_spark_worker.sh"

# From now on, it's as if you originally turned up your cluster with "-n 5".
# When deleting, remember to include those extra workers:
./bdutil -b myhdfsbucket -n 5 delete

最后,这与您最初使用 -n 5 部署集群并不完全相同;在这种情况下,主节点 /home/hadoop/hadoop-install/conf/slaves/home/hadoop/spark-install/conf/slaves 上的文件将丢失您的新节点。如果您打算使用 /home/hadoop/hadoop-install/bin/[stop|start]-all.sh/home/hadoop/spark-install/sbin/[stop|start]-all.sh,您可以手动 SSH 进入您的主节点并编辑这些文件以将您的新节点添加到列表中;如果不是,则无需更改这些从属文件。