Spark - 如何为 countVectorizer 模型创建一个包含值数组的 Spark 数据框
Spark - How to create a Spark dataframe that contains array of values in one of its columns for countVectorizer model
我正在尝试执行 Spark 的 countVectorizer 模型。作为此要求的一部分,我正在读取一个 csv 文件并从中创建一个 Dataframe (inp_DF)。
它有如下所示的 3 列
+--------------+--------+-------+
| State|Zip Code|Country|
+--------------+--------+-------+
| kentucky| 40205| us|
| indiana| 47305| us|
|greater london| sw15| gb|
| california| 92707| us|
| victoria| 3000| au|
| paris| 75001| fr|
| illinois| 60608| us|
| minnesota| 55405| us|
| california| 92688| us|
+--------------+--------+-------+
我需要在同一数据框中创建第 4 列,其中包含所有这 3 列的值数组,例如
| kentucky| 40205| us| "kentucky","40205","us"
| indiana| 47305| us| "indiana","47305","us"
|greater london| sw15| gb| "greater london","sw15","gb"
| california| 92707| us| "california","92707","us"
| victoria| 3000| au| "victoria","3000","au"
| paris| 75001| fr| "paris","75001","fr"
| illinois| 60608| us| "illinois","60608","us"
| minnesota| 55405| us| "minnesota","55405","us"
| california| 92688| us| "california","92688","us"
问题 1:是否有类似 .concat 的简单命令来实现此目的?
需要此数组,因为 countVectorizer 模型的输入应该是包含值数组的列。它不应是以下错误消息中提到的字符串数据类型:
Exception in thread "main" java.lang.IllegalArgumentException:
requirement failed: Column State must be of type equal to one of the
following types: [ArrayType(StringType,true),
ArrayType(StringType,false)] but was actually of type StringType. at
scala.Predef$.require(Predef.scala:224) at
org.apache.spark.ml.util.SchemaUtils$.checkColumnTypes(SchemaUtils.scala:58)
at
org.apache.spark.ml.feature.CountVectorizerParams$class.validateAndTransformSchema(CountVectorizer.scala:75)
at
org.apache.spark.ml.feature.CountVectorizer.validateAndTransformSchema(CountVectorizer.scala:123)
at
org.apache.spark.ml.feature.CountVectorizer.transformSchema(CountVectorizer.scala:188)
at
org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:74)
at
org.apache.spark.ml.feature.CountVectorizer.fit(CountVectorizer.scala:155)
at
org.apache.spark.examples.ml.CountVectorizerExample$.main(CountVectorizerExample.scala:54)
at
org.apache.spark.examples.ml.CountVectorizerExample.main(CountVectorizerExample.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498) at
com.intellij.rt.execution.application.AppMain.main(AppMain.java:147)
Java HotSpot(TM) Client VM warning: ignoring option MaxPermSize=300m;
support was removed in 8.0
我试图从这 3 列输入数据框创建一个数组,但数组元素包含在方括号 [ ] 中。
下面给出了示例代码片段供您参考
// Read Input Dataset for countVectorizer Logic
val inp_data = spark.read.format("com.databricks.spark.csv").option("header", "True").option("inferSchema", "true")
.option("treatEmptyValuesAsNulls", "true").option("nullValue", "")
.load("Input.csv")
// Creating a Spark Dataframe from the Input Data
val inp_DF = inp_data.toDF()
// Creating an array from Spark Dataframe Columns
val inp_array = inp_DF.select("State","Zip Code","Country").collect()
println(inp_array.mkString(","))
// fit a CountVectorizerModel from the corpus
val cvModel: CountVectorizerModel = new CountVectorizer()
.setInputCol("State")
.setOutputCol("features")
.setVocabSize(4)
.setMinDF(2)
.fit(inp_DF)
问题 2:如何从这些数组元素中删除方括号 [] 并使用数组的值在数据框中创建一个新列?
问题 3:我们能否提供单列值作为 countVectorizer 模型的输入并获取特征作为输出?
您可以使用 array
函数将 array column
创建为
import org.apache.spark.sql.functions._
val inp_array = inp_DF.withColumn("arrayColumn", array("State", "Zip Code", "Country"))
这应该给你输出
+-------------+--------+-------+-------------------------+
|State |Zip Code|Country|arrayColumn |
+-------------+--------+-------+-------------------------+
|kentucky |40205 |us |[kentucky, 40205, us] |
|indiana |47305 |us |[indiana, 47305, us] |
|greaterlondon|sw15 |gb |[greaterlondon, sw15, gb]|
|california |92707 |us |[california, 92707, us] |
|victoria |3000 |au |[victoria, 3000, au] |
|paris |75001 |fr |[paris, 75001, fr] |
|illinois |60608 |us |[illinois, 60608, us] |
|minnesota |55405 |us |[minnesota, 55405, us] |
|california |92688 |us |[california, 92688, us] |
+-------------+--------+-------+-------------------------+
并且您可以在 CountVectorizerModel
中将此 dataframe
用作
val cvModel: CountVectorizerModel = new CountVectorizer()
.setInputCol("arrayColumn")
.setOutputCol("features")
.setVocabSize(4)
.setMinDF(2)
.fit(inp_array)
这回答了您的前两个问题。
现在回答你的第三个问题。 YES 您只能在 CountVectorizerModel
中使用一列,但为此您需要将该列转换为 ArrayType(StringType,true)
,这可以通过使用 array
功能同上
假设您要使用 CountVectorizerModel
中的 State
列。然后您可以通过
将 State
列的数据类型更改为 array
val single_arrayDF = inp_DF.withColumn("State", array("State"))
并将其用作
val cvModel: CountVectorizerModel = new CountVectorizer()
.setInputCol("State")
.setOutputCol("features")
.setVocabSize(4)
.setMinDF(2)
.fit(single_arrayDF)
希望回答对您有所帮助。
我正在尝试执行 Spark 的 countVectorizer 模型。作为此要求的一部分,我正在读取一个 csv 文件并从中创建一个 Dataframe (inp_DF)。
它有如下所示的 3 列
+--------------+--------+-------+
| State|Zip Code|Country|
+--------------+--------+-------+
| kentucky| 40205| us|
| indiana| 47305| us|
|greater london| sw15| gb|
| california| 92707| us|
| victoria| 3000| au|
| paris| 75001| fr|
| illinois| 60608| us|
| minnesota| 55405| us|
| california| 92688| us|
+--------------+--------+-------+
我需要在同一数据框中创建第 4 列,其中包含所有这 3 列的值数组,例如
| kentucky| 40205| us| "kentucky","40205","us"
| indiana| 47305| us| "indiana","47305","us"
|greater london| sw15| gb| "greater london","sw15","gb"
| california| 92707| us| "california","92707","us"
| victoria| 3000| au| "victoria","3000","au"
| paris| 75001| fr| "paris","75001","fr"
| illinois| 60608| us| "illinois","60608","us"
| minnesota| 55405| us| "minnesota","55405","us"
| california| 92688| us| "california","92688","us"
问题 1:是否有类似 .concat 的简单命令来实现此目的?
需要此数组,因为 countVectorizer 模型的输入应该是包含值数组的列。它不应是以下错误消息中提到的字符串数据类型:
Exception in thread "main" java.lang.IllegalArgumentException: requirement failed: Column State must be of type equal to one of the following types: [ArrayType(StringType,true), ArrayType(StringType,false)] but was actually of type StringType. at scala.Predef$.require(Predef.scala:224) at org.apache.spark.ml.util.SchemaUtils$.checkColumnTypes(SchemaUtils.scala:58) at org.apache.spark.ml.feature.CountVectorizerParams$class.validateAndTransformSchema(CountVectorizer.scala:75) at org.apache.spark.ml.feature.CountVectorizer.validateAndTransformSchema(CountVectorizer.scala:123) at org.apache.spark.ml.feature.CountVectorizer.transformSchema(CountVectorizer.scala:188) at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:74) at org.apache.spark.ml.feature.CountVectorizer.fit(CountVectorizer.scala:155) at org.apache.spark.examples.ml.CountVectorizerExample$.main(CountVectorizerExample.scala:54) at org.apache.spark.examples.ml.CountVectorizerExample.main(CountVectorizerExample.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at com.intellij.rt.execution.application.AppMain.main(AppMain.java:147) Java HotSpot(TM) Client VM warning: ignoring option MaxPermSize=300m; support was removed in 8.0
我试图从这 3 列输入数据框创建一个数组,但数组元素包含在方括号 [ ] 中。
下面给出了示例代码片段供您参考
// Read Input Dataset for countVectorizer Logic
val inp_data = spark.read.format("com.databricks.spark.csv").option("header", "True").option("inferSchema", "true")
.option("treatEmptyValuesAsNulls", "true").option("nullValue", "")
.load("Input.csv")
// Creating a Spark Dataframe from the Input Data
val inp_DF = inp_data.toDF()
// Creating an array from Spark Dataframe Columns
val inp_array = inp_DF.select("State","Zip Code","Country").collect()
println(inp_array.mkString(","))
// fit a CountVectorizerModel from the corpus
val cvModel: CountVectorizerModel = new CountVectorizer()
.setInputCol("State")
.setOutputCol("features")
.setVocabSize(4)
.setMinDF(2)
.fit(inp_DF)
问题 2:如何从这些数组元素中删除方括号 [] 并使用数组的值在数据框中创建一个新列?
问题 3:我们能否提供单列值作为 countVectorizer 模型的输入并获取特征作为输出?
您可以使用 array
函数将 array column
创建为
import org.apache.spark.sql.functions._
val inp_array = inp_DF.withColumn("arrayColumn", array("State", "Zip Code", "Country"))
这应该给你输出
+-------------+--------+-------+-------------------------+
|State |Zip Code|Country|arrayColumn |
+-------------+--------+-------+-------------------------+
|kentucky |40205 |us |[kentucky, 40205, us] |
|indiana |47305 |us |[indiana, 47305, us] |
|greaterlondon|sw15 |gb |[greaterlondon, sw15, gb]|
|california |92707 |us |[california, 92707, us] |
|victoria |3000 |au |[victoria, 3000, au] |
|paris |75001 |fr |[paris, 75001, fr] |
|illinois |60608 |us |[illinois, 60608, us] |
|minnesota |55405 |us |[minnesota, 55405, us] |
|california |92688 |us |[california, 92688, us] |
+-------------+--------+-------+-------------------------+
并且您可以在 CountVectorizerModel
中将此 dataframe
用作
val cvModel: CountVectorizerModel = new CountVectorizer()
.setInputCol("arrayColumn")
.setOutputCol("features")
.setVocabSize(4)
.setMinDF(2)
.fit(inp_array)
这回答了您的前两个问题。
现在回答你的第三个问题。 YES 您只能在 CountVectorizerModel
中使用一列,但为此您需要将该列转换为 ArrayType(StringType,true)
,这可以通过使用 array
功能同上
假设您要使用 CountVectorizerModel
中的 State
列。然后您可以通过
State
列的数据类型更改为 array
val single_arrayDF = inp_DF.withColumn("State", array("State"))
并将其用作
val cvModel: CountVectorizerModel = new CountVectorizer()
.setInputCol("State")
.setOutputCol("features")
.setVocabSize(4)
.setMinDF(2)
.fit(single_arrayDF)
希望回答对您有所帮助。