无法将“ordId”从字符串向上转换为 int,因为它可能会截断
Cannot up cast `ordId` from string to int as it may truncate
我正在尝试将一个小文件作为数据集读取,但出现错误
"Cannot up cast ordId
from string to int as it may truncate".
代码如下:
object Main {
case class Orders(ordId: Int, custId: Int, amount: Float, date: String)
def main(args : Array[String]): Unit ={
val schema = Encoders.product[Orders].schema
val spark = SparkSession.builder
.master ("local[*]")
.appName ("")
.getOrCreate ()
val df = spark.read.option("header",true).csv("/mnt/data/orders.txt")
import spark.implicits._
val ds = df.as[Orders]
}
}
orders.txt
ordId,custId,amount,date
1234,123,400,20190112
2345,456,600,20190122
1345,123,500,20190123
3456,345,800,20190202
5678,123,600,20190203
6578,455,900,20190301
我该如何解决这个错误?。另外我想知道我是否需要先将文件作为 Dataframe 读取,然后再转换为 Dataset?
尝试传递 schema
(使用.schema
),同时阅读DataFrame
。
import org.apache.spark.sql.Encoders
val schema = Encoders.product[Orders].schema
val ds=spark.read.option("header",true).schema(schema).csv("/mnt/data/orders.txt").as[Orders]
ds.show()
结果:
+-----+------+------+--------+
|ordId|custId|amount| date|
+-----+------+------+--------+
| 1234| 123| 400.0|20190112|
| 2345| 456| 600.0|20190122|
| 1345| 123| 500.0|20190123|
| 3456| 345| 800.0|20190202|
| 5678| 123| 600.0|20190203|
| 6578| 455| 900.0|20190301|
+-----+------+------+--------+
架构:
ds.printSchema()
root
|-- ordId: integer (nullable = true)
|-- custId: integer (nullable = true)
|-- amount: float (nullable = true)
|-- date: string (nullable = true)
更新:
有多种方法可以从日期列中提取月份信息
- 使用
unix_timestamp
、from_unixtime
函数:
ds.withColumn("mnth",from_unixtime(unix_timestamp($"date","yyyyMMdd"),"MMM")).show()
(或)
- 使用
to_date
、date_format
函数:
ds.withColumn("mnth",date_format(to_date($"date","yyyyMMdd"),"MMM")).show()
结果:
+-----+------+------+--------+----+
|ordId|custId|amount| date|mnth|
+-----+------+------+--------+----+
| 1234| 123| 400.0|20190112| Jan|
| 2345| 456| 600.0|20190122| Jan|
| 1345| 123| 500.0|20190123| Jan|
| 3456| 345| 800.0|20190202| Feb|
| 5678| 123| 600.0|20190203| Feb|
| 6578| 455| 900.0|20190301| Mar|
+-----+------+------+--------+----+
我正在尝试将一个小文件作为数据集读取,但出现错误
"Cannot up cast
ordId
from string to int as it may truncate".
代码如下:
object Main {
case class Orders(ordId: Int, custId: Int, amount: Float, date: String)
def main(args : Array[String]): Unit ={
val schema = Encoders.product[Orders].schema
val spark = SparkSession.builder
.master ("local[*]")
.appName ("")
.getOrCreate ()
val df = spark.read.option("header",true).csv("/mnt/data/orders.txt")
import spark.implicits._
val ds = df.as[Orders]
}
}
orders.txt
ordId,custId,amount,date
1234,123,400,20190112
2345,456,600,20190122
1345,123,500,20190123
3456,345,800,20190202
5678,123,600,20190203
6578,455,900,20190301
我该如何解决这个错误?。另外我想知道我是否需要先将文件作为 Dataframe 读取,然后再转换为 Dataset?
尝试传递 schema
(使用.schema
),同时阅读DataFrame
。
import org.apache.spark.sql.Encoders
val schema = Encoders.product[Orders].schema
val ds=spark.read.option("header",true).schema(schema).csv("/mnt/data/orders.txt").as[Orders]
ds.show()
结果:
+-----+------+------+--------+
|ordId|custId|amount| date|
+-----+------+------+--------+
| 1234| 123| 400.0|20190112|
| 2345| 456| 600.0|20190122|
| 1345| 123| 500.0|20190123|
| 3456| 345| 800.0|20190202|
| 5678| 123| 600.0|20190203|
| 6578| 455| 900.0|20190301|
+-----+------+------+--------+
架构:
ds.printSchema()
root
|-- ordId: integer (nullable = true)
|-- custId: integer (nullable = true)
|-- amount: float (nullable = true)
|-- date: string (nullable = true)
更新:
有多种方法可以从日期列中提取月份信息
- 使用
unix_timestamp
、from_unixtime
函数:
ds.withColumn("mnth",from_unixtime(unix_timestamp($"date","yyyyMMdd"),"MMM")).show()
(或)
- 使用
to_date
、date_format
函数:
ds.withColumn("mnth",date_format(to_date($"date","yyyyMMdd"),"MMM")).show()
结果:
+-----+------+------+--------+----+
|ordId|custId|amount| date|mnth|
+-----+------+------+--------+----+
| 1234| 123| 400.0|20190112| Jan|
| 2345| 456| 600.0|20190122| Jan|
| 1345| 123| 500.0|20190123| Jan|
| 3456| 345| 800.0|20190202| Feb|
| 5678| 123| 600.0|20190203| Feb|
| 6578| 455| 900.0|20190301| Mar|
+-----+------+------+--------+----+