在 Scala 和 Spark 中读取 zst 存档:本地 zStandard 库不可用
Reading a zst archive in Scala & Spark: native zStandard library not available
我正在尝试使用 Scala 上的 Spark 读取 zst 压缩文件。
import org.apache.spark.sql._
import org.apache.spark.sql.types._
val schema = new StructType()
.add("title", StringType, true)
.add("selftext", StringType, true)
.add("score", LongType, true)
.add("created_utc", LongType, true)
.add("subreddit", StringType, true)
.add("author", StringType, true)
val df_with_schema = spark.read.schema(schema).json("/home/user/repos/concepts/abcde/RS_2019-09.zst")
df_with_schema.take(1)
不幸的是,这会产生以下错误:
org.apache.spark.SparkException: Job aborted due to stage failure:
Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0
in stage 0.0 (TID 0) (192.168.0.101 executor driver):
java.lang.RuntimeException: native zStandard library not available:
this version of libhadoop was built without zstd support.
我的 hadoop checknative 看起来如下,但是我从 here 了解到 Apache Spark 有它自己的 ZStandardCodec。
Native library checking:
- hadoop: true /opt/hadoop/lib/native/libhadoop.so.1.0.0
- zlib: true /lib/x86_64-linux-gnu/libz.so.1
- zstd : true /lib/x86_64-linux-gnu/libzstd.so.1
- snappy: true /lib/x86_64-linux-gnu/libsnappy.so.1
- lz4: true revision:10301
- bzip2: true /lib/x86_64-linux-gnu/libbz2.so.1
- openssl: false EVP_CIPHER_CTX_cleanup
- ISA-L: false libhadoop was built without ISA-L support
- PMDK: false The native code was built without PMDK support.
任何想法都非常感谢,谢谢!
更新 1:
根据此 post,我更好地理解了消息的含义,即默认情况下编译 Hadoop 时未启用 zstd,因此可能的解决方案之一显然是在启用该标志的情况下构建它。
由于我不想自己构建 Hadoop,受所用解决方法的启发 here,我已将 Spark 配置为使用 Hadoop 本机库:
spark.driver.extraLibraryPath=/opt/hadoop/lib/native
spark.executor.extraLibraryPath=/opt/hadoop/lib/native
我现在可以毫无问题地将 zst 存档读入 DataFrame。
我正在尝试使用 Scala 上的 Spark 读取 zst 压缩文件。
import org.apache.spark.sql._
import org.apache.spark.sql.types._
val schema = new StructType()
.add("title", StringType, true)
.add("selftext", StringType, true)
.add("score", LongType, true)
.add("created_utc", LongType, true)
.add("subreddit", StringType, true)
.add("author", StringType, true)
val df_with_schema = spark.read.schema(schema).json("/home/user/repos/concepts/abcde/RS_2019-09.zst")
df_with_schema.take(1)
不幸的是,这会产生以下错误:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0) (192.168.0.101 executor driver): java.lang.RuntimeException: native zStandard library not available: this version of libhadoop was built without zstd support.
我的 hadoop checknative 看起来如下,但是我从 here 了解到 Apache Spark 有它自己的 ZStandardCodec。
Native library checking:
- hadoop: true /opt/hadoop/lib/native/libhadoop.so.1.0.0
- zlib: true /lib/x86_64-linux-gnu/libz.so.1
- zstd : true /lib/x86_64-linux-gnu/libzstd.so.1
- snappy: true /lib/x86_64-linux-gnu/libsnappy.so.1
- lz4: true revision:10301
- bzip2: true /lib/x86_64-linux-gnu/libbz2.so.1
- openssl: false EVP_CIPHER_CTX_cleanup
- ISA-L: false libhadoop was built without ISA-L support
- PMDK: false The native code was built without PMDK support.
任何想法都非常感谢,谢谢!
更新 1: 根据此 post,我更好地理解了消息的含义,即默认情况下编译 Hadoop 时未启用 zstd,因此可能的解决方案之一显然是在启用该标志的情况下构建它。
由于我不想自己构建 Hadoop,受所用解决方法的启发 here,我已将 Spark 配置为使用 Hadoop 本机库:
spark.driver.extraLibraryPath=/opt/hadoop/lib/native
spark.executor.extraLibraryPath=/opt/hadoop/lib/native
我现在可以毫无问题地将 zst 存档读入 DataFrame。