即使在 bootstrap 中安装 pip 后导入也无法在 Amazon EMR 中运行

Import not working in Amazon EMR even after pip install in bootstrap

我正在尝试 运行 Amazon EMR(版本:emr-6.1.0),并希望预安装一些 python 包。

所以,我使用了 bootstrap 脚本:

#!/bin/bash
sudo pip3 install --user pyspark pandas xlrd==1.2.0

EMR 启动正常。但是当我尝试导入我安装的任何模块时,它会出现导入错误。

Python 3.7.10 | packaged by conda-forge | (default, Feb 19 2021, 16:07:37) 
Type 'copyright', 'credits' or 'license' for more information
IPython 7.24.1 -- An enhanced Interactive Python. Type '?' for help.

>>> import xlrd

---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
<ipython-input-1-2743bb67f6dd> in <module>
----> 1 import xlrd

ModuleNotFoundError: No module named 'xlrd'

我的第一个想法是没有安装包,而是安装了 EMR 日志文件,

stdout.gz(即在路径中:Amazon S3 /aws-logs-600286585385-us-east-1/elasticmapreduce/j-27GOG786YFR2SB/node/i-02fabe3g74jf9959a/bootstrap-actions/1/) 另有说明:

Collecting pyspark
  Downloading https://files.pythonhosted.org/packages/89/db/e18cfd78e408de957821ec5ca56de1250645b05f8523d169803d8df35a64/pyspark-3.1.2.tar.gz (212.4MB)
Collecting pandas
  Downloading https://files.pythonhosted.org/packages/99/f7/01cea7f6c963100f045876eb4aa1817069c5c9eca73d2dbfb5d31ff9a39f/pandas-1.3.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.whl (10.8MB)
Collecting xlrd==1.2.0
  Downloading https://files.pythonhosted.org/packages/b0/16/63576a1a001752e34bf8ea62e367997530dc553b689356b9879339cf45a4/xlrd-1.2.0-py2.py3-none-any.whl (103kB)
Collecting py4j==0.10.9 (from pyspark)
  Downloading https://files.pythonhosted.org/packages/9e/b6/6a4fb90cd235dc8e265a6a2067f2a2c99f0d91787f06aca4bcf7c23f3f80/py4j-0.10.9-py2.py3-none-any.whl (198kB)
Collecting python-dateutil>=2.7.3 (from pandas)
  Downloading https://files.pythonhosted.org/packages/d4/70/d60450c3dd48ef87586924207ae8907090de0b306af2bce5d134d78615cb/python_dateutil-2.8.1-py2.py3-none-any.whl (227kB)
Collecting numpy>=1.17.3 (from pandas)
  Downloading https://files.pythonhosted.org/packages/2c/d2/8973eb282fc3c7e6c4db0469f0390d81d8eb9ae56dfaa2a7e6db07283682/numpy-1.21.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.whl (14.1MB)
Requirement already satisfied: pytz>=2017.3 in /usr/local/lib/python3.7/site-packages (from pandas)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.7/site-packages (from python-dateutil>=2.7.3->pandas)
Installing collected packages: py4j, pyspark, python-dateutil, numpy, pandas, xlrd
  Running setup.py install for pyspark: started
    Running setup.py install for pyspark: finished with status 'done'
Successfully installed numpy-1.21.0 pandas-1.3.0 py4j-0.10.9 pyspark-3.1.2 python-dateutil-2.8.1 xlrd-1.2.0

对正在发生的事情或如何解决问题有任何想法吗?

如果pip3不行,试试这个

sudo python3 -m pip install pandas xlrd==1.2.0

我在使用 emr-5.26.0 时遇到了类似的问题。它对我有用。但不确定 pip3 installpython3 -m pip install

之间有什么区别

我解决了我的问题,并在此处发布出了什么问题。

创建EMR实例后,首先进入“Starting”状态。即使您可以在此状态下将笔记本连接到 EMR 实例,但 bootstrapping 尚未完成。一段时间后,实例自动进入“Bootstrapping”状态,在该状态下执行 bootstrap 命令。

我在“Bootstrapping”状态结束之前尝试导入包时犯了一个错误,导致导入错误。

有关 EMR 实例生命周期的更多信息,请查看此 doc