如何解决 Spark Context 的路径问题? AnalysisException:路径不存在:文件:/opt/workspace/
How to solve path issue for Spark Context? AnalysisException: Path does not exist: file:/opt/workspace/
我是 运行 MacOS 上的 JupyterLab。部分代码:
new_list =[]
for k in get_matching_s3_keys(bucket='cw-milenko-tests', prefix='Json_gzips/ticr_calculated_2', suffix='.gz'):
new_list.append(k)
dfs = [spark.read.json(file) for file in new_list]
print (map(lambda df: len(df.schema), dfs))
我从 S3 下载然后保存到列表。我收到此错误:
AnalysisException: Path does not exist: file:/opt/workspace/Json_gzips/ticr_calculated_2_2020-05-27T00-01-21.json.gz;
这是我使用的Spark集群
我在 docker 上使用了这个 repo spark 集群
如何检查我的 Docker 容器是否通信?
docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
a01477cd9316 andreper/spark-worker:latest "/bin/sh -c 'bin/spa…" 4 days ago Up 3 hours 0.0.0.0:8082->8081/tcp spark-worker-2
f448de886c72 andreper/spark-worker:latest "/bin/sh -c 'bin/spa…" 4 days ago Up 3 hours 0.0.0.0:8081->8081/tcp spark-worker-1
5789c47ef46e andreper/jupyterlab:latest "/bin/sh -c 'jupyter…" 4 days ago Up 3 hours 0.0.0.0:8888->8888/tcp jupyterlab
63e3d3c90ed6 andreper/spark-master:latest "/bin/sh -c 'bin/spa…" 4 days ago Up 3 hours 0.0.0.0:7077->7077/tcp, 0.0.0.0:8080->8080/tcp spark-master
我检查了 jupyterlab 和 spark-master 的安装
milenko@Cloudwalkers-MacBook-Pro spark-cluster-on-docker % docker inspect -f '{{ .Mounts }}' 5789c47ef46e
[{volume hadoop-distributed-file-system /var/lib/docker/volumes/hadoop-distributed-file-system/_data /opt/workspace local rw true }]
milenko@Cloudwalkers-MacBook-Pro spark-cluster-on-docker % docker inspect -f '{{ .Mounts }}' 63e3d3c90ed6
[{volume hadoop-distributed-file-system /var/lib/docker/volumes/hadoop-distributed-file-system/_data /opt/workspace local rw true }]
如何将这个文件上传到HDFS中的相应路径?
您可以使用 hdfs dfs -copyFromLocal /local/path/to.json /hdfs/path/to.json
将文件从本地存储添加到 hdfs。
添加 file:///path/to/your.json
并检查 spark 是否可以在您的本地文件系统上找到它。
我是 运行 MacOS 上的 JupyterLab。部分代码:
new_list =[]
for k in get_matching_s3_keys(bucket='cw-milenko-tests', prefix='Json_gzips/ticr_calculated_2', suffix='.gz'):
new_list.append(k)
dfs = [spark.read.json(file) for file in new_list]
print (map(lambda df: len(df.schema), dfs))
我从 S3 下载然后保存到列表。我收到此错误:
AnalysisException: Path does not exist: file:/opt/workspace/Json_gzips/ticr_calculated_2_2020-05-27T00-01-21.json.gz;
这是我使用的Spark集群
我在 docker 上使用了这个 repo spark 集群
如何检查我的 Docker 容器是否通信?
docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
a01477cd9316 andreper/spark-worker:latest "/bin/sh -c 'bin/spa…" 4 days ago Up 3 hours 0.0.0.0:8082->8081/tcp spark-worker-2
f448de886c72 andreper/spark-worker:latest "/bin/sh -c 'bin/spa…" 4 days ago Up 3 hours 0.0.0.0:8081->8081/tcp spark-worker-1
5789c47ef46e andreper/jupyterlab:latest "/bin/sh -c 'jupyter…" 4 days ago Up 3 hours 0.0.0.0:8888->8888/tcp jupyterlab
63e3d3c90ed6 andreper/spark-master:latest "/bin/sh -c 'bin/spa…" 4 days ago Up 3 hours 0.0.0.0:7077->7077/tcp, 0.0.0.0:8080->8080/tcp spark-master
我检查了 jupyterlab 和 spark-master 的安装
milenko@Cloudwalkers-MacBook-Pro spark-cluster-on-docker % docker inspect -f '{{ .Mounts }}' 5789c47ef46e
[{volume hadoop-distributed-file-system /var/lib/docker/volumes/hadoop-distributed-file-system/_data /opt/workspace local rw true }]
milenko@Cloudwalkers-MacBook-Pro spark-cluster-on-docker % docker inspect -f '{{ .Mounts }}' 63e3d3c90ed6
[{volume hadoop-distributed-file-system /var/lib/docker/volumes/hadoop-distributed-file-system/_data /opt/workspace local rw true }]
如何将这个文件上传到HDFS中的相应路径?
您可以使用 hdfs dfs -copyFromLocal /local/path/to.json /hdfs/path/to.json
将文件从本地存储添加到 hdfs。
添加 file:///path/to/your.json
并检查 spark 是否可以在您的本地文件系统上找到它。