通过 Spark 作业加载到 BigQuery 失败,并出现为 parquet 找到的多个源的异常

Load to BigQuery Via Spark Job Fails with an Exception for Multiple sources found for parquet

我有一个 spark 作业正在将数据加载到 dataproc 集群中的 BigQuery.The spark 作业 运行s。 这是片段

df.write
      .format("bigquery")
      .mode(writeMode)
      .option("table",tabName)
      .save()

我在 spark-submit 命令的 --jars 参数中指定了 spark bigquery 依赖 jar (spark-bigquery-with-dependencies_2.12-0.19.1.jar)

当我运行编译代码时出现以下异常java.lang.RuntimeException: Failed to write to BigQuery

详细错误

Caused by: org.apache.spark.sql.AnalysisException: Multiple sources found for parquet (org.apache.spark.sql.execution.datasources.v2.parquet.ParquetDataSourceV2, org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat), please specify the fully qualified class name.
    at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:717)

这是我项目中的依赖项

<dependencies>
        <dependency>
            <groupId>org.scala-lang</groupId>
            <artifactId>scala-library</artifactId>
            <version>2.12.14</version>
        </dependency>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-sql_2.12</artifactId>
            <version>2.4.8</version>
        </dependency>
        <dependency>
            <groupId>com.google.cloud</groupId>
            <artifactId>google-cloud-bigquery</artifactId>
            <version>1.133.1</version>
        </dependency>
        <dependency>
            <groupId>com.google.cloud.spark</groupId>
            <artifactId>spark-bigquery_2.12</artifactId>
            <version>0.21.1</version>
        </dependency>
        <dependency>
            <groupId>com.google.cloud</groupId>
            <artifactId>google-cloud-storage</artifactId>
            <version>1.116.0</version>
        </dependency>

        <dependency>
            <groupId>org.apache.maven.plugins</groupId>
            <artifactId>maven-shade-plugin</artifactId>
            <version>3.1.1</version>
        </dependency>
        <dependency>
            <groupId>net.alchim31.maven</groupId>
            <artifactId>scala-maven-plugin</artifactId>
            <version>3.3.3</version>
        </dependency>
    </dependencies>

我正在为 运行 spark 作业构建一个 uber jar 如果,我删除了 --jars 参数,作业在读取 bigquery 时失败 table

java.lang.ClassNotFoundException: Failed to find data source: bigquery. Please find packages at http://spark.apache.org/third-party-projects.html
    at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:689)

您似乎正在使用 Spark 3.x 和一个已编译并包含 spark 2.4.8 工件的 jar。解决方案很简单:将 scala-library 和 spark-sql 标记为范围 provided。此外,当您在外部引入 spark-bigquery-connector 时,您无需将其添加到代码中(以及 google-cloud-* 依赖项,除非您直接使用它们)