如何在 pyspark 中获取 Python 个库？

Question

我想在 pyspark 中使用 matplotlib.bblpath 或 shapely.geometry 库。

当我尝试导入它们中的任何一个时，出现以下错误：

>>> from shapely.geometry import polygon
Traceback (most recent call last):
 File "<stdin>", line 1, in <module>
ImportError: No module named shapely.geometry

我知道该模块不存在，但如何将这些包引入我的 pyspark 库？

Answer 1

在 Spark 上下文中尝试使用：

SparkContext.addPyFile("module.py")  # also .zip

，引用自 docs：

Add a .py or .zip dependency for all tasks to be executed on this SparkContext in the future. The path passed can be either a local file, a file in HDFS (or other Hadoop-supported filesystems), or an HTTP, HTTPS or FTP URI.

Answer 2

这是在独立环境中（即 laptop/desktop）还是在集群环境中（例如 AWS EMR）？

如果在您的 laptop/desktop 上，pip install shapely 应该可以正常工作。您可能需要检查默认 python 环境的环境变量。例如，如果您通常使用 Python 3 但对 pyspark 使用 Python 2，那么 pyspark 将无法正常使用。

如果在AWS EMR等集群环境下，可以尝试：

import os

def myfun(x):`
        os.system("pip install shapely")
        return x
rdd = sc.parallelize([1,2,3,4]) ## assuming 4 worker nodes
rdd.map(lambda x: myfun(x)).collect() 
## call each cluster to run the code to import the library

"I know the module isn't present, but I want to know how can these packages be brought to my pyspark libraries."

在 EMR 上，如果您希望 pyspark 预先准备好您想要的任何其他库和配置，您可以使用 bootstrap 步骤来进行这些调整。除此之外，如果不在 Scala 中编译 Spark，你就无法 "add" 一个库到 pyspark（如果你不熟悉 SBT，那将是一件很痛苦的事情）。

Answer 3

这就是我在 AWS EMR 集群中使用它的方式（它在任何其他集群中也应该相同）。我创建了以下 shell 脚本并将其作为 bootstrap-actions:

执行

#!/bin/bash
# shapely installation
wget http://download.osgeo.org/geos/geos-3.5.0.tar.bz2
tar jxf geos-3.5.0.tar.bz2
cd geos-3.5.0 && ./configure --prefix=$HOME/geos-bin && make && make install
sudo cp /home/hadoop/geos-bin/lib/* /usr/lib
sudo /bin/sh -c 'echo "/usr/lib" >> /etc/ld.so.conf'
sudo /bin/sh -c 'echo "/usr/lib/local" >> /etc/ld.so.conf'
sudo /sbin/ldconfig
sudo /bin/sh -c 'echo -e "\nexport LD_LIBRARY_PATH=/usr/lib" >> /home/hadoop/.bashrc'
source /home/hadoop/.bashrc
sudo pip install shapely
echo "Shapely installation complete"
pip install https://pypi.python.org/packages/74/84/fa80c5e92854c7456b591f6e797c5be18315994afd3ef16a58694e1b5eb1/Geohash-1.0.tar.gz
#
exit 0

注意：代替运行作为bootstrap-actions，此脚本可以在集群中的每个节点中独立执行。我已经测试了这两种情况。

以下是一个示例 pyspark 和 shapely 代码 (Spark SQL UDF)，以确保上述命令按预期工作：

Python 2.7.10 (default, Dec  8 2015, 18:25:23) 
[GCC 4.8.3 20140911 (Red Hat 4.8.3-9)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 1.6.1
      /_/

Using Python version 2.7.10 (default, Dec  8 2015 18:25:23)
SparkContext available as sc, HiveContext available as sqlContext.
>>> from pyspark.sql.functions import udf
>>> from pyspark.sql.types import StringType
>>> from shapely.wkt import loads as load_wkt
>>> def parse_region(region):
...     from shapely.wkt import loads as load_wkt
...     reverse_coordinate = lambda coord: ' '.join(reversed(coord.split(':')))
...     coordinate_list = map(reverse_coordinate, region.split(', '))
...     if coordinate_list[0] != coordinate_list[-1]:
...         coordinate_list.append(coordinate_list[0])
...     return str(load_wkt('POLYGON ((%s))' % ','.join(coordinate_list)).wkt)
... 
>>> udf_parse_region=udf(parse_region, StringType())
16/09/06 22:18:34 WARN ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 1.2.0
16/09/06 22:18:34 WARN ObjectStore: Failed to get database default, returning NoSuchObjectException
>>> df = sqlContext.sql('select id, bounds from <schema.table_name> limit 10')
>>> df2 = df.withColumn('bounds1', udf_parse_region('bounds'))
>>> df2.first()
Row(id=u'0089d43a-1b42-4fba-80d6-dda2552ee08e', bounds=u'33.42838509594465:-119.0533447265625, 33.39170168789402:-119.0203857421875, 33.29992542601392:-119.0478515625', bounds1=u'POLYGON ((-119.0533447265625 33.42838509594465, -119.0203857421875 33.39170168789402, -119.0478515625 33.29992542601392, -119.0533447265625 33.42838509594465))')
>>>

谢谢，侯赛因博赫拉

Answer 4

我使用 SparkContext 从 AWS Docs 中找到了一个很好的解决方案。我能够使用这个添加 Pandas 和其他包：

Using SparkContext to add packages to notebook with PySpark Kernel in EMR

sc.install_pypi_package("pandas==0.25.1")

如何在 pyspark 中获取 Python 个库？

How do I get Python libraries in pyspark?

python

python-2.7

shapely

pyspark