如何在 AWS EMR 上设置 PYTHONHASHSEED
How to set PYTHONHASHSEED on AWS EMR
有什么方法可以在 EMR 集群的所有节点上设置环境变量吗?
我在 Python3 PySpark 中尝试使用 reduceByKey() 时遇到错误,并收到有关哈希种子的错误。我可以看到这是一个已知错误,环境变量 PYTHONHASHSEED 需要在集群的所有节点上设置为相同的值,但我没有任何运气。
我尝试通过集群配置向 spark-env 添加一个变量:
[
{
"Classification": "spark-env",
"Configurations": [
{
"Classification": "export",
"Properties": {
"PYSPARK_PYTHON": "/usr/bin/python3",
"PYTHONHASHSEED": "123"
}
}
]
},
{
"Classification": "spark",
"Properties": {
"maximizeResourceAllocation": "true"
}
}
]
但这不起作用。我还尝试添加 bootstrap 脚本:
#!/bin/bash
export PYTHONHASHSEED=123
但这似乎也不起作用。
您可能可以通过 bootstrap 脚本完成此操作,但您需要执行以下操作:
echo "PYTHONHASHSEED=XXXX" >> /home/hadoop/.bashrc
(或可能 .profile
)
以便它在启动时被 spark 进程拾取。
虽然您的配置看起来合理,但可能值得在 hadoop-env
部分中设置它?
来自火花docs
Note: When running Spark on YARN in cluster mode, environment variables need to be set using the spark.yarn.appMasterEnv.[EnvironmentVariableName] property in your conf/spark-defaults.conf file. Environment variables that are set in spark-env.sh will not be reflected in the YARN Application Master process in cluster mode. See the YARN-related Spark Properties for more information.
属性已列出 here 所以我想你想要这个:
Add the environment variable specified by EnvironmentVariableName to the Application Master process launched on YARN.
spark.yarn.appMasterEnv.PYTHONHASHSEED="XXXX"
用于配置 spark-defaults.conf 的 EMR 文档是 here。
[
{
"Classification": "spark-defaults",
"Properties": {
"spark.yarn.appMasterEnv.PYTHONHASHSEED: "XXX"
}
}
]
我相信 /usr/bin/python3
没有获取您在 spark-env
范围下的集群配置中定义的环境变量 PYTHONHASHSEED
。
您应该使用 python34
而不是 /usr/bin/python3
并按如下方式设置配置:
[
{
"classification":"spark-defaults",
"properties":{
// [...]
}
},
{
"configurations":[
{
"classification":"export",
"properties":{
"PYSPARK_PYTHON":"python34",
"PYTHONHASHSEED":"123"
}
}
],
"classification":"spark-env",
"properties":{
// [...]
}
}
]
现在,我们来测试一下。我定义了一个 bash 脚本调用两个 python
s :
#!/bin/bash
echo "using python34"
for i in `seq 1 10`;
do
python -c "print(hash('foo'))";
done
echo "----------------------"
echo "using /usr/bin/python3"
for i in `seq 1 10`;
do
/usr/bin/python3 -c "print(hash('foo'))";
done
判决:
[hadoop@ip-10-0-2-182 ~]$ bash test.sh
using python34
-4177197833195190597
-4177197833195190597
-4177197833195190597
-4177197833195190597
-4177197833195190597
-4177197833195190597
-4177197833195190597
-4177197833195190597
-4177197833195190597
-4177197833195190597
----------------------
using /usr/bin/python3
8867846273747294950
-7610044127871105351
6756286456855631480
-4541503224938367706
7326699722121877093
3336202789104553110
3462714165845110404
-5390125375246848302
-7753272571662122146
8018968546238984314
PS1: 我正在使用 AMI 版本 emr-4.8.2
.
PS2: 片段灵感来自 .
编辑: 我已经使用 pyspark
.
测试了以下内容
16/11/22 07:16:56 INFO EventLoggingListener: Logging events to hdfs:///var/log/spark/apps/application_1479798580078_0001
16/11/22 07:16:56 INFO YarnClientSchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.8
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 1.6.2
/_/
Using Python version 3.4.3 (default, Sep 1 2016 23:33:38)
SparkContext available as sc, HiveContext available as sqlContext.
>>> print(hash('foo'))
-2457967226571033580
>>> print(hash('foo'))
-2457967226571033580
>>> print(hash('foo'))
-2457967226571033580
>>> print(hash('foo'))
-2457967226571033580
>>> print(hash('foo'))
-2457967226571033580
还创建了一个简单的应用程序 (simple_app.py
):
from pyspark import SparkContext
sc = SparkContext(appName = "simple-app")
numbers = [hash('foo') for i in range(10)]
print(numbers)
这似乎也很完美:
[hadoop@ip-*** ~]$ spark-submit --master yarn simple_app.py
输出(截断):
[...]
16/11/22 07:28:42 INFO YarnClientSchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.8
[-5869373620241885594, -5869373620241885594, -5869373620241885594, -5869373620241885594, -5869373620241885594, -5869373620241885594, -5869373620241885594, -5869373620241885594, -5869373620241885594, -5869373620241885594] // THE RELEVANT LINE IS HERE.
16/11/22 07:28:42 INFO SparkContext: Invoking stop() from shutdown hook
[...]
如您所见,每次返回相同的散列也有效。
编辑 2: 从评论来看,您似乎是在尝试计算执行程序而不是驱动程序的哈希值,因此您需要设置 spark.executorEnv.PYTHONHASHSEED
,在您的 spark 应用程序配置中,以便它可以在执行程序上传播(这是一种方法)。
Note : Setting the environment variables for executors is the same with YARN client, use the spark.executorEnv.[EnvironmentVariableName].
因此下面是 simple_app.py
的极简主义例子:
from pyspark import SparkContext, SparkConf
conf = SparkConf().set("spark.executorEnv.PYTHONHASHSEED","123")
sc = SparkContext(appName="simple-app", conf=conf)
numbers = sc.parallelize(['foo']*10).map(lambda x: hash(x)).collect()
print(numbers)
现在让我们再次测试一下。这是截断的输出:
16/11/22 14:14:34 INFO DAGScheduler: Job 0 finished: collect at /home/hadoop/simple_app.py:6, took 14.251514 s
[-5869373620241885594, -5869373620241885594, -5869373620241885594, -5869373620241885594, -5869373620241885594, -5869373620241885594, -5869373620241885594, -5869373620241885594, -5869373620241885594, -5869373620241885594]
16/11/22 14:14:34 INFO SparkContext: Invoking stop() from shutdown hook
我认为这涵盖了所有内容。
刚遇到同样的问题,添加如下配置解决了:
# Some settings...
Configurations=[
{
"Classification": "spark-env",
"Properties": {},
"Configurations": [
{
"Classification": "export",
"Properties": {
"PYSPARK_PYTHON": "python34"
},
"Configurations": []
}
]
},
{
"Classification": "hadoop-env",
"Properties": {},
"Configurations": [
{
"Classification": "export",
"Properties": {
"PYTHONHASHSEED": "0"
},
"Configurations": []
}
]
}
],
# Some more settings...
注意:我们不使用 yarn 作为集群管理器,目前集群只有 运行 Hadoop 和 Spark。
编辑:根据 Tim B 的评论,这似乎也适用于作为集群管理器安装的 yarn。
有什么方法可以在 EMR 集群的所有节点上设置环境变量吗?
我在 Python3 PySpark 中尝试使用 reduceByKey() 时遇到错误,并收到有关哈希种子的错误。我可以看到这是一个已知错误,环境变量 PYTHONHASHSEED 需要在集群的所有节点上设置为相同的值,但我没有任何运气。
我尝试通过集群配置向 spark-env 添加一个变量:
[
{
"Classification": "spark-env",
"Configurations": [
{
"Classification": "export",
"Properties": {
"PYSPARK_PYTHON": "/usr/bin/python3",
"PYTHONHASHSEED": "123"
}
}
]
},
{
"Classification": "spark",
"Properties": {
"maximizeResourceAllocation": "true"
}
}
]
但这不起作用。我还尝试添加 bootstrap 脚本:
#!/bin/bash
export PYTHONHASHSEED=123
但这似乎也不起作用。
您可能可以通过 bootstrap 脚本完成此操作,但您需要执行以下操作:
echo "PYTHONHASHSEED=XXXX" >> /home/hadoop/.bashrc
(或可能 .profile
)
以便它在启动时被 spark 进程拾取。
虽然您的配置看起来合理,但可能值得在 hadoop-env
部分中设置它?
来自火花docs
Note: When running Spark on YARN in cluster mode, environment variables need to be set using the spark.yarn.appMasterEnv.[EnvironmentVariableName] property in your conf/spark-defaults.conf file. Environment variables that are set in spark-env.sh will not be reflected in the YARN Application Master process in cluster mode. See the YARN-related Spark Properties for more information.
属性已列出 here 所以我想你想要这个:
Add the environment variable specified by EnvironmentVariableName to the Application Master process launched on YARN.
spark.yarn.appMasterEnv.PYTHONHASHSEED="XXXX"
用于配置 spark-defaults.conf 的 EMR 文档是 here。
[
{
"Classification": "spark-defaults",
"Properties": {
"spark.yarn.appMasterEnv.PYTHONHASHSEED: "XXX"
}
}
]
我相信 /usr/bin/python3
没有获取您在 spark-env
范围下的集群配置中定义的环境变量 PYTHONHASHSEED
。
您应该使用 python34
而不是 /usr/bin/python3
并按如下方式设置配置:
[
{
"classification":"spark-defaults",
"properties":{
// [...]
}
},
{
"configurations":[
{
"classification":"export",
"properties":{
"PYSPARK_PYTHON":"python34",
"PYTHONHASHSEED":"123"
}
}
],
"classification":"spark-env",
"properties":{
// [...]
}
}
]
现在,我们来测试一下。我定义了一个 bash 脚本调用两个 python
s :
#!/bin/bash
echo "using python34"
for i in `seq 1 10`;
do
python -c "print(hash('foo'))";
done
echo "----------------------"
echo "using /usr/bin/python3"
for i in `seq 1 10`;
do
/usr/bin/python3 -c "print(hash('foo'))";
done
判决:
[hadoop@ip-10-0-2-182 ~]$ bash test.sh
using python34
-4177197833195190597
-4177197833195190597
-4177197833195190597
-4177197833195190597
-4177197833195190597
-4177197833195190597
-4177197833195190597
-4177197833195190597
-4177197833195190597
-4177197833195190597
----------------------
using /usr/bin/python3
8867846273747294950
-7610044127871105351
6756286456855631480
-4541503224938367706
7326699722121877093
3336202789104553110
3462714165845110404
-5390125375246848302
-7753272571662122146
8018968546238984314
PS1: 我正在使用 AMI 版本 emr-4.8.2
.
PS2: 片段灵感来自
编辑: 我已经使用 pyspark
.
16/11/22 07:16:56 INFO EventLoggingListener: Logging events to hdfs:///var/log/spark/apps/application_1479798580078_0001
16/11/22 07:16:56 INFO YarnClientSchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.8
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 1.6.2
/_/
Using Python version 3.4.3 (default, Sep 1 2016 23:33:38)
SparkContext available as sc, HiveContext available as sqlContext.
>>> print(hash('foo'))
-2457967226571033580
>>> print(hash('foo'))
-2457967226571033580
>>> print(hash('foo'))
-2457967226571033580
>>> print(hash('foo'))
-2457967226571033580
>>> print(hash('foo'))
-2457967226571033580
还创建了一个简单的应用程序 (simple_app.py
):
from pyspark import SparkContext
sc = SparkContext(appName = "simple-app")
numbers = [hash('foo') for i in range(10)]
print(numbers)
这似乎也很完美:
[hadoop@ip-*** ~]$ spark-submit --master yarn simple_app.py
输出(截断):
[...]
16/11/22 07:28:42 INFO YarnClientSchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.8
[-5869373620241885594, -5869373620241885594, -5869373620241885594, -5869373620241885594, -5869373620241885594, -5869373620241885594, -5869373620241885594, -5869373620241885594, -5869373620241885594, -5869373620241885594] // THE RELEVANT LINE IS HERE.
16/11/22 07:28:42 INFO SparkContext: Invoking stop() from shutdown hook
[...]
如您所见,每次返回相同的散列也有效。
编辑 2: 从评论来看,您似乎是在尝试计算执行程序而不是驱动程序的哈希值,因此您需要设置 spark.executorEnv.PYTHONHASHSEED
,在您的 spark 应用程序配置中,以便它可以在执行程序上传播(这是一种方法)。
Note : Setting the environment variables for executors is the same with YARN client, use the
spark.executorEnv.[EnvironmentVariableName].
因此下面是 simple_app.py
的极简主义例子:
from pyspark import SparkContext, SparkConf
conf = SparkConf().set("spark.executorEnv.PYTHONHASHSEED","123")
sc = SparkContext(appName="simple-app", conf=conf)
numbers = sc.parallelize(['foo']*10).map(lambda x: hash(x)).collect()
print(numbers)
现在让我们再次测试一下。这是截断的输出:
16/11/22 14:14:34 INFO DAGScheduler: Job 0 finished: collect at /home/hadoop/simple_app.py:6, took 14.251514 s
[-5869373620241885594, -5869373620241885594, -5869373620241885594, -5869373620241885594, -5869373620241885594, -5869373620241885594, -5869373620241885594, -5869373620241885594, -5869373620241885594, -5869373620241885594]
16/11/22 14:14:34 INFO SparkContext: Invoking stop() from shutdown hook
我认为这涵盖了所有内容。
刚遇到同样的问题,添加如下配置解决了:
# Some settings...
Configurations=[
{
"Classification": "spark-env",
"Properties": {},
"Configurations": [
{
"Classification": "export",
"Properties": {
"PYSPARK_PYTHON": "python34"
},
"Configurations": []
}
]
},
{
"Classification": "hadoop-env",
"Properties": {},
"Configurations": [
{
"Classification": "export",
"Properties": {
"PYTHONHASHSEED": "0"
},
"Configurations": []
}
]
}
],
# Some more settings...
注意:我们不使用 yarn 作为集群管理器,目前集群只有 运行 Hadoop 和 Spark。
编辑:根据 Tim B 的评论,这似乎也适用于作为集群管理器安装的 yarn。