Why does getInt inside RDD[Row].map give "error: value getInt is not a member of Any"?
Why does getInt inside RDD[Row].map give "error: value getInt is not a member of Any"?
我是 Scala-Spark 的新手,但我需要用它来开发我的最终项目学士学位。
我正在尝试根据数据构建 K 均值算法。
数据来自kaggle:https://www.kaggle.com/murderaccountability/homicide-reports
我读取了包含数据的文件。
创建案例 class,例如:
case class CrimeReport (Record_ID: String, Agency_Name: String,
City: String, State: String, Year: Int, Month: Int, Crime_Type: String,
Crime_Solved: String, Victim_Sex: String, Victim_Age: Int, Victim_Race: String,
Perpetrator_Sex: String, Perpetrator_Age: String, Perpetrator_Race: String, Relationship: String, Victim_Count: String)
我将我的数据映射到案例 class。例如,月份是字符串,我需要 Int(在我的特征向量之后创建)我定义了一个函数来解析它:
//Parsear Month: String ===> Int
def parseMonthToNumber(month: String) : Int = {
var result = 0
month match {
case "January" => result = 1
case "February" => result = 2
case "March" => result = 3
case "April" => result = 4
case "May" => result = 5
case "June" => result = 6
case "July" => result = 7
case "August" => result = 8
case "September" => result = 9
case "October" => result = 10
case "November" => result = 11
case _ => result = 12
}
result
}
data = sc.textFile (... .csv)
val data_split = data.map(line => line.split(","))
val allData = data_split.map(p => CrimeReport(p(0).toString,
p(1).toString, p(2).toString, p(3).toString, parseInt(p(4)),
parseMonthToNumber(p(5)), p(6).toString, p(7).toString, p(8).toString,
parseInt(p(9)), p(10).toString, p(11).toString, p(12).toString,
p(13).toString, p(14).toString, p(15).toString))
//DataFrame
val allDF = allData.toDF()
//convert data to RDD which will be passed to KMeans
val rowsRDD = allDF.rdd.map( x =>
(x(0).getString, x.getString(1), x.getString(2), x.getString(3), x(4).getInt, x(5).getInt, x.getString(6), x.getString(7), x.getString(8), x(9).getInt, x.getString(10), x.getString(11), x.getString(12), x.getString(13), x.getString(14), x.getString(15))
)
但是我得到这个错误:
error: value getInt is not a member of Any
(x(0).getString, x.getString(1), x.getString(2), x.getString(3), x(4).getInt, x(5).getInt, x.getString(6), x.getString(7), x.getString(8), x(9).getInt, x.getString(10), x.getString(11), x.getString(12), x.getString(13), x.getString(14), x.getString(15))
^
为什么?
我假设是最新版本Spark 2.1.1。
首先让我问你一个问题,因为有 DataFrame-based KMeans implementation in Spark.
为什么要将 DataFrame 转换为 RDD[Row]
来执行 KMeans
继续阅读 KMeans in Spark MLlib。
我不会这样做,因为 Spark MLlib's RDD-based API is deprecated:
This page documents sections of the MLlib guide for the RDD-based API (the spark.mllib
package). Please see the MLlib Main Guide for the DataFrame-based API (the spark.ml
package), which is now the primary API for MLlib.
话虽如此,让我们看看您遇到了什么问题。
如果我是你(并无视坚持使用 Spark MLlib 的基于 DataFrame 的建议 API),我会执行以下操作:
// val allDF = allData.toDF()
val allDF = allData.toDS
有了上面的内容,你就会有一个比纯粹的 Row
.
工作起来更愉快的 Dataset[CrimeReport]
完成转换后,您可以
val rowsRDD = allDF.rdd.map { x => ... }
其中 x
属于您的类型 CrimeReport
,我相信您会知道如何处理它。
直接回答你的问题,错误原因:
error: value getInt is not a member of Any
是 x(5)
(和其他人)属于 Any
类型,因此您必须将其转换为您的类型,或者只需将 x(5)
替换为 x.getInt(5)
。
查看 Row 的 scaladoc。
当我们在 case class 而不是 double 中处理 String 数据类型时,我们如何使用 kmeans?我的这段代码将无法工作,因为 vector 需要一个双精度值。
// Passing in Crime_Type, Crime_Solved, Perpetrator_Race to KMeans as
the attributes we want to use to assign the instance to a cluster.
val vectors = allDF.rdd.map(r => Vectors.dense( r.Crime_Type, r.Crime_Solved, r.Perpetrator_Race ))
//KMeans model with 2 clusters and 10 iterations
val kMeansModel = KMeans.train(vectors, 2, 10)
您应该将要在方法 Vector.dense 中使用的属性定义为 int/double。
之后,当您将案例 class 映射到文件中的数据时,您应该调用之前定义的函数。正如你在这里看到的:
val data_split = data.map(line => line.split(","))
val allData = data_split.map(p =>
CrimeReport(p(0).toString, p(1).toString, p(2).toString, p(3).toString, parseInt(p(4)), parseMonthToNumber(p(5)), p(6).toString, parseSolved(p(7)), parseSex(p(8)), parseInt(p(9)), parseRaceToNumber(p(10)), p(11).toString, p(12).toString, p(13).toString, p(14).toString, p(15).toString))
函数是:
//Filter and Cleaning data => Crime Solved
def parseSolved (solved: String): Int = {
var result = 0
solved match {
case "Yes" => result = 1
case _ => result = 0
}
result
}
或者:
//Parsear Victim_Race: String ===> Int
def parseRaceToNumber (crType : String) : Int = {
var result = 0
val race = crType.split("/")
race(0) match {
case "White" => result = 1
case "Black" => result = 2
case "Asian" => result = 3
case "Native American" => result = 4
case _ => result = 0
}
result
}
我是 Scala-Spark 的新手,但我需要用它来开发我的最终项目学士学位。
我正在尝试根据数据构建 K 均值算法。 数据来自kaggle:https://www.kaggle.com/murderaccountability/homicide-reports
我读取了包含数据的文件。 创建案例 class,例如:
case class CrimeReport (Record_ID: String, Agency_Name: String,
City: String, State: String, Year: Int, Month: Int, Crime_Type: String,
Crime_Solved: String, Victim_Sex: String, Victim_Age: Int, Victim_Race: String,
Perpetrator_Sex: String, Perpetrator_Age: String, Perpetrator_Race: String, Relationship: String, Victim_Count: String)
我将我的数据映射到案例 class。例如,月份是字符串,我需要 Int(在我的特征向量之后创建)我定义了一个函数来解析它:
//Parsear Month: String ===> Int
def parseMonthToNumber(month: String) : Int = {
var result = 0
month match {
case "January" => result = 1
case "February" => result = 2
case "March" => result = 3
case "April" => result = 4
case "May" => result = 5
case "June" => result = 6
case "July" => result = 7
case "August" => result = 8
case "September" => result = 9
case "October" => result = 10
case "November" => result = 11
case _ => result = 12
}
result
}
data = sc.textFile (... .csv)
val data_split = data.map(line => line.split(","))
val allData = data_split.map(p => CrimeReport(p(0).toString,
p(1).toString, p(2).toString, p(3).toString, parseInt(p(4)),
parseMonthToNumber(p(5)), p(6).toString, p(7).toString, p(8).toString,
parseInt(p(9)), p(10).toString, p(11).toString, p(12).toString,
p(13).toString, p(14).toString, p(15).toString))
//DataFrame
val allDF = allData.toDF()
//convert data to RDD which will be passed to KMeans
val rowsRDD = allDF.rdd.map( x =>
(x(0).getString, x.getString(1), x.getString(2), x.getString(3), x(4).getInt, x(5).getInt, x.getString(6), x.getString(7), x.getString(8), x(9).getInt, x.getString(10), x.getString(11), x.getString(12), x.getString(13), x.getString(14), x.getString(15))
)
但是我得到这个错误:
error: value getInt is not a member of Any
(x(0).getString, x.getString(1), x.getString(2), x.getString(3), x(4).getInt, x(5).getInt, x.getString(6), x.getString(7), x.getString(8), x(9).getInt, x.getString(10), x.getString(11), x.getString(12), x.getString(13), x.getString(14), x.getString(15))
^
为什么?
我假设是最新版本Spark 2.1.1。
首先让我问你一个问题,因为有 DataFrame-based KMeans implementation in Spark.
为什么要将 DataFrame 转换为RDD[Row]
来执行 KMeans
继续阅读 KMeans in Spark MLlib。
我不会这样做,因为 Spark MLlib's RDD-based API is deprecated:
This page documents sections of the MLlib guide for the RDD-based API (the
spark.mllib
package). Please see the MLlib Main Guide for the DataFrame-based API (thespark.ml
package), which is now the primary API for MLlib.
话虽如此,让我们看看您遇到了什么问题。
如果我是你(并无视坚持使用 Spark MLlib 的基于 DataFrame 的建议 API),我会执行以下操作:
// val allDF = allData.toDF()
val allDF = allData.toDS
有了上面的内容,你就会有一个比纯粹的 Row
.
Dataset[CrimeReport]
完成转换后,您可以
val rowsRDD = allDF.rdd.map { x => ... }
其中 x
属于您的类型 CrimeReport
,我相信您会知道如何处理它。
直接回答你的问题,错误原因:
error: value getInt is not a member of Any
是 x(5)
(和其他人)属于 Any
类型,因此您必须将其转换为您的类型,或者只需将 x(5)
替换为 x.getInt(5)
。
查看 Row 的 scaladoc。
当我们在 case class 而不是 double 中处理 String 数据类型时,我们如何使用 kmeans?我的这段代码将无法工作,因为 vector 需要一个双精度值。
// Passing in Crime_Type, Crime_Solved, Perpetrator_Race to KMeans as
the attributes we want to use to assign the instance to a cluster.
val vectors = allDF.rdd.map(r => Vectors.dense( r.Crime_Type, r.Crime_Solved, r.Perpetrator_Race ))
//KMeans model with 2 clusters and 10 iterations
val kMeansModel = KMeans.train(vectors, 2, 10)
您应该将要在方法 Vector.dense 中使用的属性定义为 int/double。
之后,当您将案例 class 映射到文件中的数据时,您应该调用之前定义的函数。正如你在这里看到的:
val data_split = data.map(line => line.split(","))
val allData = data_split.map(p =>
CrimeReport(p(0).toString, p(1).toString, p(2).toString, p(3).toString, parseInt(p(4)), parseMonthToNumber(p(5)), p(6).toString, parseSolved(p(7)), parseSex(p(8)), parseInt(p(9)), parseRaceToNumber(p(10)), p(11).toString, p(12).toString, p(13).toString, p(14).toString, p(15).toString))
函数是:
//Filter and Cleaning data => Crime Solved
def parseSolved (solved: String): Int = {
var result = 0
solved match {
case "Yes" => result = 1
case _ => result = 0
}
result
}
或者:
//Parsear Victim_Race: String ===> Int
def parseRaceToNumber (crType : String) : Int = {
var result = 0
val race = crType.split("/")
race(0) match {
case "White" => result = 1
case "Black" => result = 2
case "Asian" => result = 3
case "Native American" => result = 4
case _ => result = 0
}
result
}