如何使用 sparklyr R 包中的 spark_read_avro?
How to use spark_read_avro from sparklyr R package?
我正在使用:
R 版本 4.1.1
sparklyr 版本‘1.7.2’
我使用 databricks-connect 连接到我的 databricks 集群,并尝试使用以下代码读取 avro 文件:
library(sparklyr)
library(dplyr)
sc <- spark_connect(
method = "databricks",
spark_home = "my_spark_home_path",
version = "3.1.1",
packages = c("avro")
)
df_path = "s3a://my_s3_path"
df = spark_read_avro(sc, path = df_path, memory = FALSE)
我也尝试过显式添加包:
library(sparklyr)
library(dplyr)
sc <- spark_connect(
method = "databricks",
spark_home = "my_spark_home_path",
version = "3.1.1",
packages = "org.apache.spark:spark-avro_2.12:3.1.1"
)
df_path = "s3a://my_s3_path"
df = spark_read_avro(sc, path = df_path, memory = FALSE)
spark 连接正常,我可以正常读取 parquet 文件,但是在读取 avro 文件时我总是得到:
Error in validate_spark_avro_pkg_version(sc) :
Avro support must be enabled with `spark_connect(..., version = <version>, packages = c("avro", <other package(s)>), ...)` or by explicitly including 'org.apache.spark:spark-avro_2.12:3.1.1-SNAPSHOT' for Spark version 3.1.1-SNAPSHOT in list of packages
有人知道如何解决这个问题吗?
我找到了使用 sparkavro 软件包的解决方法:
library(sparklyr)
library(dplyr)
library(sparkavro)
sc <- spark_connect(
method = "databricks",
spark_home = "my_spark_home_path")
df_path = "s3a://my_s3_path"
df = spark_read_avro(
sc,
path = df_path,
name = "my_table_name",
memory = FALSE)
我正在使用: R 版本 4.1.1 sparklyr 版本‘1.7.2’
我使用 databricks-connect 连接到我的 databricks 集群,并尝试使用以下代码读取 avro 文件:
library(sparklyr)
library(dplyr)
sc <- spark_connect(
method = "databricks",
spark_home = "my_spark_home_path",
version = "3.1.1",
packages = c("avro")
)
df_path = "s3a://my_s3_path"
df = spark_read_avro(sc, path = df_path, memory = FALSE)
我也尝试过显式添加包:
library(sparklyr)
library(dplyr)
sc <- spark_connect(
method = "databricks",
spark_home = "my_spark_home_path",
version = "3.1.1",
packages = "org.apache.spark:spark-avro_2.12:3.1.1"
)
df_path = "s3a://my_s3_path"
df = spark_read_avro(sc, path = df_path, memory = FALSE)
spark 连接正常,我可以正常读取 parquet 文件,但是在读取 avro 文件时我总是得到:
Error in validate_spark_avro_pkg_version(sc) :
Avro support must be enabled with `spark_connect(..., version = <version>, packages = c("avro", <other package(s)>), ...)` or by explicitly including 'org.apache.spark:spark-avro_2.12:3.1.1-SNAPSHOT' for Spark version 3.1.1-SNAPSHOT in list of packages
有人知道如何解决这个问题吗?
我找到了使用 sparkavro 软件包的解决方法:
library(sparklyr)
library(dplyr)
library(sparkavro)
sc <- spark_connect(
method = "databricks",
spark_home = "my_spark_home_path")
df_path = "s3a://my_s3_path"
df = spark_read_avro(
sc,
path = df_path,
name = "my_table_name",
memory = FALSE)