检查 Scala Dataframe 列中的字符串列表是否存在于 Map 的值中
Checking if list of strings in a Scala Dataframe column is present in the value of a Map
我有以下数据:
val df = Seq(
(1, List("A")),
(2, List("A")),
(3, List("A", "B")),
(4, List("C")),
(5, List("A")),
(6, List("A", "C")),
(7, List("B")),
(8, List("A", "B", "C")),
(9, List("A"))
).toDF("Serial Number", "my_list")
+--------------------+--------------------+
| Serial Number| my_list|
+--------------------+--------------------+
| 1| [A]|
| 2| [A]|
| 3| [A,B]|
| 4| [C]|
| 5| [A]|
| 6| [A, C]|
| 7| [B]|
| 8| [A, B, C]|
| 9| [A]|
+--------------------+--------------------+
我有地图
val category_Mapping = Map("Category1" -> [A, B],
"Category2" -> [C],
"Category3" -> [B, D])
我想查找数据["my_list"]中的每个列表元素和return每个数据["序列号"]的输出映射,方法如下:
+--------------------+--------------------+------------------------------------------+
| Serial Number| my_list| output |
+--------------------+--------------------+------------------------------------------+
| 1| [A]|{Category1->1, Category2->0, Category3->0}|
| 2| [A]|{Category1->1, Category2->0, Category3->0}|
| 3| [A,B]|{Category1->1, Category2->0, Category3->1}|
| 4| [C]|{Category1->0, Category2->1, Category3->0}|
| 5| [A]|{Category1->1, Category2->0, Category3->0}|
| 6| [A, C]|{Category1->1, Category2->1, Category3->0}|
| 7| [B]|{Category1->1, Category2->0, Category3->1}|
| 8| [A, B, C]|{Category1->1, Category2->1, Category3->1}|
| 9| [A]|{Category1->1, Category2->0, Category3->0}|
+--------------------+--------------------+------------------------------------------+
基本上,我想要 return 一个输出映射,如果数据 ["my_list"] 中的列表中的元素出现在 category_Mapping 中,则输出映射的值为 1。反正我能做到吗?
编辑:大约 5 小时了,没有人回答。有人可以帮我解决这个问题吗?
你可以试试这个
我是在 spark local 模式而不是集群上这样做的
// Assuming that your dataframe is stored in a variable called df
// Define a function which will return your map based on the given array in the colum n 'my_list'
def function(lst: mutable.WrappedArray[String]): Map[String, Int] = {
var map: scala.collection.mutable.Map[String, Int] = scala.collection.mutable.Map("Category1" -> 0, "Category2" -> 0, "Category3" -> 0)
lst.foreach { l =>
map.keys.foreach { key =>
if (Map("Category1" -> Array("A", "B"), "Category2" -> Array("C"), "Category3" -> Array("B", "D"))(key).contains(l))
map(key) = 1
}
}
map.toMap
}
// now you can define a udf which will just call the above defined function
val output = udf { (lst: mutable.WrappedArray[String]) => {
function(lst)
}
}
// now you can call the udf on the column 'my_list'
df.withColumn("output", output(col("my_list"))).show(false)
// The output will be as given below
+-------------+---------+------------------------------------------------+
|Serial Number|my_list |output |
+-------------+---------+------------------------------------------------+
|1 |[A] |[Category2 -> 0, Category1 -> 1, Category3 -> 0]|
|2 |[A] |[Category2 -> 0, Category1 -> 1, Category3 -> 0]|
|3 |[A, B] |[Category2 -> 0, Category1 -> 1, Category3 -> 1]|
|4 |[C] |[Category2 -> 1, Category1 -> 0, Category3 -> 0]|
|5 |[A] |[Category2 -> 0, Category1 -> 1, Category3 -> 0]|
|6 |[A, C] |[Category2 -> 1, Category1 -> 1, Category3 -> 0]|
|7 |[B] |[Category2 -> 0, Category1 -> 1, Category3 -> 1]|
|8 |[A, B, C]|[Category2 -> 1, Category1 -> 1, Category3 -> 1]|
|9 |[A] |[Category2 -> 0, Category1 -> 1, Category3 -> 0]|
+-------------+---------+------------------------------------------------+
要根据category_Mapping在输出列中获取映射的键,我们可以将category_Mapping变量作为参数传递给udf,并在函数中使用它来动态定义输出地图。可以按如下方式完成:
val category_Mapping = Map("Category1" -> Array("A", "B"), "Category2" -> Array("C"), "Category3" -> Array("B", "D"))
def function(lst: mutable.WrappedArray[String], category_Mapping: Map[String, Array[String]]): Map[String, Int] = {
var map: scala.collection.mutable.Map[String, Int] = scala.collection.mutable.Map()
lst.foreach { l =>
category_Mapping.keys.foreach { key =>
if(!map.contains(key))
map(key) = 0
if (category_Mapping(key).contains(l))
map(key) = 1
}
}
map.toMap
}
// the definition of udf has changed in this case.
def output (category_Mapping: Map[String, Array[String]]) = udf { (lst: mutable.WrappedArray[String]) => {
function(lst,category_Mapping)
}
}
df.withColumn("output", output(category_Mapping)(col("my_list"))).show(false)
我有以下数据:
val df = Seq(
(1, List("A")),
(2, List("A")),
(3, List("A", "B")),
(4, List("C")),
(5, List("A")),
(6, List("A", "C")),
(7, List("B")),
(8, List("A", "B", "C")),
(9, List("A"))
).toDF("Serial Number", "my_list")
+--------------------+--------------------+
| Serial Number| my_list|
+--------------------+--------------------+
| 1| [A]|
| 2| [A]|
| 3| [A,B]|
| 4| [C]|
| 5| [A]|
| 6| [A, C]|
| 7| [B]|
| 8| [A, B, C]|
| 9| [A]|
+--------------------+--------------------+
我有地图
val category_Mapping = Map("Category1" -> [A, B],
"Category2" -> [C],
"Category3" -> [B, D])
我想查找数据["my_list"]中的每个列表元素和return每个数据["序列号"]的输出映射,方法如下:
+--------------------+--------------------+------------------------------------------+
| Serial Number| my_list| output |
+--------------------+--------------------+------------------------------------------+
| 1| [A]|{Category1->1, Category2->0, Category3->0}|
| 2| [A]|{Category1->1, Category2->0, Category3->0}|
| 3| [A,B]|{Category1->1, Category2->0, Category3->1}|
| 4| [C]|{Category1->0, Category2->1, Category3->0}|
| 5| [A]|{Category1->1, Category2->0, Category3->0}|
| 6| [A, C]|{Category1->1, Category2->1, Category3->0}|
| 7| [B]|{Category1->1, Category2->0, Category3->1}|
| 8| [A, B, C]|{Category1->1, Category2->1, Category3->1}|
| 9| [A]|{Category1->1, Category2->0, Category3->0}|
+--------------------+--------------------+------------------------------------------+
基本上,我想要 return 一个输出映射,如果数据 ["my_list"] 中的列表中的元素出现在 category_Mapping 中,则输出映射的值为 1。反正我能做到吗?
编辑:大约 5 小时了,没有人回答。有人可以帮我解决这个问题吗?
你可以试试这个
我是在 spark local 模式而不是集群上这样做的
// Assuming that your dataframe is stored in a variable called df
// Define a function which will return your map based on the given array in the colum n 'my_list'
def function(lst: mutable.WrappedArray[String]): Map[String, Int] = {
var map: scala.collection.mutable.Map[String, Int] = scala.collection.mutable.Map("Category1" -> 0, "Category2" -> 0, "Category3" -> 0)
lst.foreach { l =>
map.keys.foreach { key =>
if (Map("Category1" -> Array("A", "B"), "Category2" -> Array("C"), "Category3" -> Array("B", "D"))(key).contains(l))
map(key) = 1
}
}
map.toMap
}
// now you can define a udf which will just call the above defined function
val output = udf { (lst: mutable.WrappedArray[String]) => {
function(lst)
}
}
// now you can call the udf on the column 'my_list'
df.withColumn("output", output(col("my_list"))).show(false)
// The output will be as given below
+-------------+---------+------------------------------------------------+
|Serial Number|my_list |output |
+-------------+---------+------------------------------------------------+
|1 |[A] |[Category2 -> 0, Category1 -> 1, Category3 -> 0]|
|2 |[A] |[Category2 -> 0, Category1 -> 1, Category3 -> 0]|
|3 |[A, B] |[Category2 -> 0, Category1 -> 1, Category3 -> 1]|
|4 |[C] |[Category2 -> 1, Category1 -> 0, Category3 -> 0]|
|5 |[A] |[Category2 -> 0, Category1 -> 1, Category3 -> 0]|
|6 |[A, C] |[Category2 -> 1, Category1 -> 1, Category3 -> 0]|
|7 |[B] |[Category2 -> 0, Category1 -> 1, Category3 -> 1]|
|8 |[A, B, C]|[Category2 -> 1, Category1 -> 1, Category3 -> 1]|
|9 |[A] |[Category2 -> 0, Category1 -> 1, Category3 -> 0]|
+-------------+---------+------------------------------------------------+
要根据category_Mapping在输出列中获取映射的键,我们可以将category_Mapping变量作为参数传递给udf,并在函数中使用它来动态定义输出地图。可以按如下方式完成:
val category_Mapping = Map("Category1" -> Array("A", "B"), "Category2" -> Array("C"), "Category3" -> Array("B", "D"))
def function(lst: mutable.WrappedArray[String], category_Mapping: Map[String, Array[String]]): Map[String, Int] = {
var map: scala.collection.mutable.Map[String, Int] = scala.collection.mutable.Map()
lst.foreach { l =>
category_Mapping.keys.foreach { key =>
if(!map.contains(key))
map(key) = 0
if (category_Mapping(key).contains(l))
map(key) = 1
}
}
map.toMap
}
// the definition of udf has changed in this case.
def output (category_Mapping: Map[String, Array[String]]) = udf { (lst: mutable.WrappedArray[String]) => {
function(lst,category_Mapping)
}
}
df.withColumn("output", output(category_Mapping)(col("my_list"))).show(false)