如何使用 scala spark 从没有 header 且列数超过 150 的 csv 创建数据集
How to create a Dataset from a csv which doesn't have a header and has more than 150 columns using scala spark
我有一个 csv,我需要将其读取为数据集。 csv 有 140 列,但没有 header。
我用 StructType(Seq(StructFiled(...), Seq(StructFiled(...), ...))
创建了一个模式,要读取的代码如下:-
object dataParser {
def getData(inputPath: String, delimeter: String)(implicit spark: SparkSession): Dataset[MyCaseClass] = {
val parsedData: Dataset[MyCaseClass] = spark.read
.option("header", "false")
.option("delimeter", "delimeter")
.option("inferSchema", "true")
.schema(mySchema)
.load(inputPath)
.as[MyCaseClass]
parsedData
}
}
我创建的案例 class 是这样的:-
case class MycaseClass(
mycaseClass1: MyCaseClass1,
mycaseClass2: MyCaseClass2,
mycaseClass3: MyCaseClass3,
mycaseClass4: MyCaseClass4,
mycaseClass5: MyCaseClass5,
mycaseClass6: MyCaseClass6,
mycaseClass7: MyCaseClass7,
)
MyCaseClass1(
first 20 columns of csv: it's datatypes
)
MyCaseClass2(
next 20 columns of csv: it's datatypes
)
等等。
但是当我试图编译它时,它给我一个错误如下:-
Unable to find encoder for type stored in a Dataset. Primitive types (Int, String, etc) and Product types (case classes) are supported by importing spark.implicits._ Support for serializing other types will be added in future releases.
[error] .as[myCaseClass]
我在我的 Scala 应用程序中调用它:-
object MyTestApp{
def main(args: Array[String]): Unit ={
implicit val spark: SparkSession = SparkSession.builder().config(conf).getOrCreate()
import spark.implicits._
run(args)
}
def run(args: Array[String])(implicit spark: SparkSession): Unit = {
val inputPath = args.get("inputData")
val delimeter = Constants.delimeter
val myData = Dataparser.getData(inputPath, delimeter)
}
}
```
I'm not very sure about the approach also as I'm new to Dataset.
I saw multiple answers around this issue but they were mainly for very small no of columns which can be contained within the scope of a single case class and that too with header which makes this little simpler.
Any help would be really appreciated.
感谢所有的观众。其实我发现了问题。在这里发布答案,以便遇到任何此类问题的其他人能够摆脱这个问题。
我需要在此处导入 spark.implicits._
object dataParser {
def getData(inputPath: String, delimeter: String)(implicit spark: SparkSession): Dataset[MyCaseClass] = {
**import spark.implicits._**
val parsedData: Dataset[MyCaseClass] = spark.read
.option("header", "false")
.option("delimeter", "delimeter")
.option("inferSchema", "true")
.schema(mySchema)
.load(inputPath)
.as[MyCaseClass]
parsedData
}
}
我有一个 csv,我需要将其读取为数据集。 csv 有 140 列,但没有 header。
我用 StructType(Seq(StructFiled(...), Seq(StructFiled(...), ...))
创建了一个模式,要读取的代码如下:-
object dataParser {
def getData(inputPath: String, delimeter: String)(implicit spark: SparkSession): Dataset[MyCaseClass] = {
val parsedData: Dataset[MyCaseClass] = spark.read
.option("header", "false")
.option("delimeter", "delimeter")
.option("inferSchema", "true")
.schema(mySchema)
.load(inputPath)
.as[MyCaseClass]
parsedData
}
}
我创建的案例 class 是这样的:-
case class MycaseClass(
mycaseClass1: MyCaseClass1,
mycaseClass2: MyCaseClass2,
mycaseClass3: MyCaseClass3,
mycaseClass4: MyCaseClass4,
mycaseClass5: MyCaseClass5,
mycaseClass6: MyCaseClass6,
mycaseClass7: MyCaseClass7,
)
MyCaseClass1(
first 20 columns of csv: it's datatypes
)
MyCaseClass2(
next 20 columns of csv: it's datatypes
)
等等。
但是当我试图编译它时,它给我一个错误如下:-
Unable to find encoder for type stored in a Dataset. Primitive types (Int, String, etc) and Product types (case classes) are supported by importing spark.implicits._ Support for serializing other types will be added in future releases.
[error] .as[myCaseClass]
我在我的 Scala 应用程序中调用它:-
object MyTestApp{
def main(args: Array[String]): Unit ={
implicit val spark: SparkSession = SparkSession.builder().config(conf).getOrCreate()
import spark.implicits._
run(args)
}
def run(args: Array[String])(implicit spark: SparkSession): Unit = {
val inputPath = args.get("inputData")
val delimeter = Constants.delimeter
val myData = Dataparser.getData(inputPath, delimeter)
}
}
```
I'm not very sure about the approach also as I'm new to Dataset.
I saw multiple answers around this issue but they were mainly for very small no of columns which can be contained within the scope of a single case class and that too with header which makes this little simpler.
Any help would be really appreciated.
感谢所有的观众。其实我发现了问题。在这里发布答案,以便遇到任何此类问题的其他人能够摆脱这个问题。
我需要在此处导入 spark.implicits._
object dataParser {
def getData(inputPath: String, delimeter: String)(implicit spark: SparkSession): Dataset[MyCaseClass] = {
**import spark.implicits._**
val parsedData: Dataset[MyCaseClass] = spark.read
.option("header", "false")
.option("delimeter", "delimeter")
.option("inferSchema", "true")
.schema(mySchema)
.load(inputPath)
.as[MyCaseClass]
parsedData
}
}