Spark on Linux Error: Exception in thread "main" java.io.IOException: Cannot run program "python": error=2, No such file or directory

Question

我正在学习 Learning Spark 第二版的第 2 章。当我执行示例 mnmcont.py 脚本时，出现以下错误：

21/02/08 11:40:12 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

Exception in thread "main" java.io.IOException: Cannot run program "python": error=2, No such file or directory

我用来执行脚本的命令是：

$SPARK_HOME/bin/spark-submit mnmcount.py data/mnm_dataset.csv

我在 LearningSparkV2-master/chapter2/py/src 目录中

在我的 bashrc 文件中，我添加了以下行并获取了文件。

SPARK_HOME="/usr/local/spark"
alias python="python3"
export JAVA_HOME="/usr/lib/jvm/java-11-openjdk-amd64"

mnmcount.py 脚本的完整代码如下。

from __future__ import print_function

import sys

from pyspark.sql import SparkSession
from pyspark.sql.functions import count

if __name__ == "__main__":
    if len(sys.argv) != 2:
        print("Usage: mnmcount <file>", file=sys.stderr)
        sys.exit(-1)

    spark = (SparkSession
        .builder
        .appName("PythonMnMCount")
        .getOrCreate())
    # get the M&M data set file name
    mnm_file = sys.argv[1]
    # read the file into a Spark DataFrame
    mnm_df = (spark.read.format("csv")
        .option("header", "true")
        .option("inferSchema", "true")
        .load(mnm_file))
    mnm_df.show(n=5, truncate=False)

    # aggregate count of all colors and groupBy state and color
    # orderBy descending order
    count_mnm_df = (mnm_df.select("State", "Color", "Count")
                    .groupBy("State", "Color")
                    .sum("Count")
                    .orderBy("sum(Count)", ascending=False))

    # show all the resulting aggregation for all the dates and colors
    count_mnm_df.show(n=60, truncate=False)
    print("Total Rows = %d" % (count_mnm_df.count()))

    # find the aggregate count for California by filtering
    ca_count_mnm_df = (mnm_df.select("*")
                       .where(mnm_df.State == 'CA')
                       .groupBy("State", "Color")
                       .sum("Count")
                       .orderBy("sum(Count)", ascending=False))

    # show the resulting aggregation for California
    ca_count_mnm_df.show(n=10, truncate=False)

Answer 1

添加后

export PYSPARK_PYTHON=python3

到 bashrc 问题已解决。

Answer 2

尝试将您的主人设置为“本地[*]”

Spark on Linux Error: Exception in thread "main" java.io.IOException: Cannot run program "python": error=2, No such file or directory

Spark on Linux Error: Exception in thread "main" java.io.IOException: Cannot run program "python": error=2, No such file or directory

linux-mint

apache-spark

pyspark