在 spark-shell 中使用 spark-csv 包读取 CSV
Reading CSV using spark-csv package in spark-shell
我正在尝试使用 spark-csv 从 spark-shell 中的 aws s3 读取 csv。
以下是我执行的步骤。使用以下命令启动 spark-shell
bin/spark-shell --packages com.databricks:spark-csv_2.10:1.2.0
在shell中执行了以下scala代码
scala> val hadoopConf = sc.hadoopConfiguration
scala> hadoopConf.set("fs.s3.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
scala> hadoopConf.set("fs.s3.awsAccessKeyId", "****")
scala> hadoopConf.set("fs.s3.awsSecretAccessKey", "****")
scala> val s3path = "s3n://bucket/sample.csv"
scala> val df = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").load(s3path)
出现以下错误
java.io.IOException: No FileSystem for scheme: s3n
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2584)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2591)
at org.apache.hadoop.fs.FileSystem.access0(FileSystem.java:91)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2630)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2612)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:370)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)
at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:256)
at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:228)
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:313)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:207)
at org.apache.spark.rdd.RDD$$anonfun$partitions.apply(RDD.scala:219)
at org.apache.spark.rdd.RDD$$anonfun$partitions.apply(RDD.scala:217)
at scala.Option.getOrElse(Option.scala:120)
我在这里缺少什么?请注意,我可以使用
读取 csv
scala> sc.textFile(s3path)
相同的 Scala 代码在 databricks notebook 中也能正常工作
创建了 issue in spark-csv github. I'll update here when I get answer for the issue
对于 URL s3n://bucket/sample.csv
,必须设置 s3n 的所有属性。因此,设置以下属性使我可以使用 spark-csv
读取 CSV
scala> val hadoopConf = sc.hadoopConfiguration
scala> hadoopConf.set("fs.s3n.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
scala> hadoopConf.set("fs.s3n.awsAccessKeyId", "****")
scala> hadoopConf.set("fs.s3n.awsSecretAccessKey", "****")
我正在尝试使用 spark-csv 从 spark-shell 中的 aws s3 读取 csv。
以下是我执行的步骤。使用以下命令启动 spark-shell
bin/spark-shell --packages com.databricks:spark-csv_2.10:1.2.0
在shell中执行了以下scala代码
scala> val hadoopConf = sc.hadoopConfiguration
scala> hadoopConf.set("fs.s3.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
scala> hadoopConf.set("fs.s3.awsAccessKeyId", "****")
scala> hadoopConf.set("fs.s3.awsSecretAccessKey", "****")
scala> val s3path = "s3n://bucket/sample.csv"
scala> val df = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").load(s3path)
出现以下错误
java.io.IOException: No FileSystem for scheme: s3n
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2584)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2591)
at org.apache.hadoop.fs.FileSystem.access0(FileSystem.java:91)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2630)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2612)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:370)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)
at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:256)
at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:228)
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:313)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:207)
at org.apache.spark.rdd.RDD$$anonfun$partitions.apply(RDD.scala:219)
at org.apache.spark.rdd.RDD$$anonfun$partitions.apply(RDD.scala:217)
at scala.Option.getOrElse(Option.scala:120)
我在这里缺少什么?请注意,我可以使用
读取 csvscala> sc.textFile(s3path)
相同的 Scala 代码在 databricks notebook 中也能正常工作
创建了 issue in spark-csv github. I'll update here when I get answer for the issue
对于 URL s3n://bucket/sample.csv
,必须设置 s3n 的所有属性。因此,设置以下属性使我可以使用 spark-csv
scala> val hadoopConf = sc.hadoopConfiguration
scala> hadoopConf.set("fs.s3n.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
scala> hadoopConf.set("fs.s3n.awsAccessKeyId", "****")
scala> hadoopConf.set("fs.s3n.awsSecretAccessKey", "****")