在 Java 中使用 Apache Spark 从 CSV 文件写入 Parquet 文件
Writing a Parquet file from a CSV file using Apache Spark in Java
我想使用 spark-csv 将 CSV 转换为 Parquet。
读取文件并将其保存为数据集有效。不幸的是,我无法将其作为 Parquet 文件写回。有什么办法可以实现吗?
SparkSession spark = SparkSession.builder().appName("Java Spark SQL basic example")
.config("spark.master", "local").config("spark.sql.warehouse.dir", "file:///C:\spark_warehouse")
.getOrCreate();
Dataset<Row> df = spark.read().format("com.databricks.spark.csv").option("inferSchema", "true")
.option("header", "true").load("sample.csv");
df.write().parquet("test.parquet");
异常:
17/04/11 09:57:32 ERROR Executor: Exception in task 0.0 in stage 3.0
(TID 3) java.lang.NoSuchMethodError:
org.apache.parquet.column.ParquetProperties.builder()Lorg/apache/parquet/column/ParquetProperties$Builder;
at
org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:362)
at
org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:350)
at
org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.<init>(ParquetOutputWriter.scala:37)
at
org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anon.newInstance(ParquetFileFormat.scala:145)
at
org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.<init>(FileFormatWriter.scala:234)
at
org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:182)
at
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$$anonfun.apply(FileFormatWriter.scala:129)
at
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$$anonfun.apply(FileFormatWriter.scala:128)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99) at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
我用解决方法修复了它。我不得不注释掉这两个 parquet 依赖关系,但我不太确定为什么它们会互相妨碍:
<!-- <dependency> -->
<!-- <groupId>org.apache.parquet</groupId> -->
<!-- <artifactId>parquet-hadoop</artifactId> -->
<!-- <version>1.9.0</version> -->
<!-- </dependency> -->
<!-- <dependency> -->
<!-- <groupId>org.apache.parquet</groupId> -->
<!-- <artifactId>parquet-common</artifactId> -->
<!-- <version>1.9.0</version> -->
<!-- </dependency> -->
我想使用 spark-csv 将 CSV 转换为 Parquet。
读取文件并将其保存为数据集有效。不幸的是,我无法将其作为 Parquet 文件写回。有什么办法可以实现吗?
SparkSession spark = SparkSession.builder().appName("Java Spark SQL basic example")
.config("spark.master", "local").config("spark.sql.warehouse.dir", "file:///C:\spark_warehouse")
.getOrCreate();
Dataset<Row> df = spark.read().format("com.databricks.spark.csv").option("inferSchema", "true")
.option("header", "true").load("sample.csv");
df.write().parquet("test.parquet");
异常:
17/04/11 09:57:32 ERROR Executor: Exception in task 0.0 in stage 3.0
(TID 3) java.lang.NoSuchMethodError:
org.apache.parquet.column.ParquetProperties.builder()Lorg/apache/parquet/column/ParquetProperties$Builder;
at
org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:362)
at
org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:350)
at
org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.<init>(ParquetOutputWriter.scala:37)
at
org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anon.newInstance(ParquetFileFormat.scala:145)
at
org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.<init>(FileFormatWriter.scala:234)
at
org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:182)
at
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$$anonfun.apply(FileFormatWriter.scala:129)
at
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$$anonfun.apply(FileFormatWriter.scala:128)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99) at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
我用解决方法修复了它。我不得不注释掉这两个 parquet 依赖关系,但我不太确定为什么它们会互相妨碍:
<!-- <dependency> -->
<!-- <groupId>org.apache.parquet</groupId> -->
<!-- <artifactId>parquet-hadoop</artifactId> -->
<!-- <version>1.9.0</version> -->
<!-- </dependency> -->
<!-- <dependency> -->
<!-- <groupId>org.apache.parquet</groupId> -->
<!-- <artifactId>parquet-common</artifactId> -->
<!-- <version>1.9.0</version> -->
<!-- </dependency> -->