--pyspark 中的文件选项不起作用

Question

我从命令行尝试了 sc.addFile 选项（没有任何问题）和 --files 选项（失败）。

运行 1 : spark_distro.py

from pyspark import SparkContext, SparkConf
from pyspark import SparkFiles

def import_my_special_package(x):
    from external_package import external
    ext = external()
    return ext.fun(x)

conf = SparkConf().setAppName("Using External Library")
sc = SparkContext(conf=conf)
sc.addFile("/local-path/readme.txt")
with open(SparkFiles.get('readme.txt')) as test_file:
    lines = [line.strip() for line in test_file]
print(lines)
int_rdd = sc.parallelize([1, 2, 4, 3])
mod_rdd = sorted(int_rdd.filter(lambda z: z%2 == 1).map(lambda x:import_my_special_package(x)))

外包：external_package.py

class external(object):
    def __init__(self):
        pass
    def fun(self,input):
        return input*2

readme.txt

MY TEXT HERE

spark-提交命令

spark-submit \
  --master yarn-client \
  --py-files /path to local codelib/external_package.py  \
  /local-pgm-path/spark_distro.py  \
  1000

输出：按预期工作

['MY TEXT HERE']

但是如果我尝试使用--files（而不是sc.addFile）选项从命令行传递文件（readme.txt），它就会失败。如下所示。

运行 2 : spark_distro.py

from pyspark import SparkContext, SparkConf
from pyspark import SparkFiles

def import_my_special_package(x):
    from external_package import external
    ext = external()
    return ext.fun(x)

conf = SparkConf().setAppName("Using External Library")
sc = SparkContext(conf=conf)
with open(SparkFiles.get('readme.txt')) as test_file:
    lines = [line.strip() for line in test_file]
print(lines)
int_rdd = sc.parallelize([1, 2, 4, 3])
mod_rdd = sorted(int_rdd.filter(lambda z: z%2 == 1).map(lambda x: import_my_special_package(x)))

external_package.py同上

火花提交

spark-submit \
  --master yarn-client \
  --py-files /path to local codelib/external_package.py  \
  --files /local-path/readme.txt#readme.txt  \
  /local-pgm-path/spark_distro.py  \
  1000

输出：

Traceback (most recent call last):
  File "/local-pgm-path/spark_distro.py", line 31, in <module>
    with open(SparkFiles.get('readme.txt')) as test_file:
IOError: [Errno 2] No such file or directory: u'/tmp/spark-42dff0d7-c52f-46a8-8323-08bccb412cd6/userFiles-8bd16297-1291-4a37-b080-bbc3836cb512/readme.txt'

sc.addFile和--file的用途相同吗？有人可以分享你的想法吗？

Answer 1

我终于弄明白了，这确实是一个非常微妙的问题。

正如所怀疑的那样，这两个选项（sc.addFile 和 --files）不是等价的，这在文档中（无可否认地非常巧妙地）暗示了（强调）：

addFile(path, recursive=False)
Add a file to be downloaded with this Spark job on every node.

--files FILES
Comma-separated list of files to be placed in the working directory of each executor.

用简单的英语来说，虽然添加了 sc.addFile 的文件对执行者和驱动程序都可用，但添加了 --files 的文件仅对执行者可用；因此，当尝试从驱动程序访问它们时（如 OP 中的情况），我们会收到 No such file or directory 错误。

让我们确认一下（删除 OP 中所有不相关的 --py-files 和 1000 内容）：

test_fail.py:

from pyspark import SparkContext, SparkConf
from pyspark import SparkFiles

conf = SparkConf().setAppName("Use External File")
sc = SparkContext(conf=conf)
with open(SparkFiles.get('readme.txt')) as test_file:  
    lines = [line.strip() for line in test_file]
print(lines)

测试：

spark-submit --master yarn \
             --deploy-mode client \
             --files /home/ctsats/readme.txt \
             /home/ctsats/scripts/SO/test_fail.py

结果：

[...]
17/11/10 15:05:39 INFO yarn.Client: Uploading resource file:/home/ctsats/readme.txt -> hdfs://host-hd-01.corp.nodalpoint.com:8020/user/ctsats/.sparkStaging/application_1507295423401_0047/readme.txt
[...]
Traceback (most recent call last):
  File "/home/ctsats/scripts/SO/test_fail.py", line 6, in <module>
    with open(SparkFiles.get('readme.txt')) as test_file:
IOError: [Errno 2] No such file or directory: u'/tmp/spark-8715b4d9-a23b-4002-a1f0-63a1e9d3e00e/userFiles-60053a41-472e-4844-a587-6d10ed769e1a/readme.txt'

在上面的脚本test_fail.py中，是driver程序请求访问文件readme.txt；让我们更改脚本，以便为 executors (test_success.py):

请求访问权限

from pyspark import SparkContext, SparkConf

conf = SparkConf().setAppName("Use External File")
sc = SparkContext(conf=conf)

lines = sc.textFile("readme.txt") # run in the executors
print(lines.collect())

测试：

spark-submit --master yarn \
             --deploy-mode client \
             --files /home/ctsats/readme.txt \
             /home/ctsats/scripts/SO/test_success.py

结果：

[...]
17/11/10 15:16:05 INFO yarn.Client: Uploading resource file:/home/ctsats/readme.txt -> hdfs://host-hd-01.corp.nodalpoint.com:8020/user/ctsats/.sparkStaging/application_1507295423401_0049/readme.txt
[...]
[u'MY TEXT HERE']

另请注意，这里我们不需要 SparkFiles.get - 该文件很容易访问。

如上所述，sc.addFile 在两种情况下都有效，即当驱动程序或执行程序（已测试但未在此处显示）请求访问时。

关于命令行选项的顺序：正如我所说，所有与 Spark 相关的参数都必须在要执行的脚本之前；可以说，--files 和 --py-files 的相对顺序是无关紧要的（留作练习）。

测试了 Spark 1.6.0 & 2.2.0.

UPDATE（评论后）：似乎我的 fs.defaultFS 设置也指向 HDFS：

$ hdfs getconf -confKey fs.defaultFS
hdfs://host-hd-01.corp.nodalpoint.com:8020

但让我关注这里的森林（而不是树木），并解释一下为什么整个讨论仅具有学术意义:

使用 --files 标志传递 待处理的文件 是不好的做法；事后看来，我现在明白了为什么我在网上几乎找不到任何使用参考——可能没有人在实践中使用它，而且有充分的理由。

（请注意，我不是在为 --py-files 说话，它起着不同的合法作用。）

由于 Spark 是一个分布式处理框架，运行在集群和分布式文件系统 (HDFS) 上，最好的办法是将所有要处理的文件都放入 HDFS - 时期。 "natural" Spark 处理文件的地方是 HDFS，而不是本地 FS - 虽然有一些 toy 示例使用本地 FS 仅用于演示目的。更重要的是，如果你想在未来的某个时间将部署模式更改为 cluster，你会发现默认情况下集群对本地路径和文件一无所知，这是理所当然的......

--pyspark 中的文件选项不起作用

--files option in pyspark not working

hadoop-yarn

apache-spark

pyspark