检查 Scala Dataframe 列中的字符串列表是否存在于 Map 的值中

Checking if list of strings in a Scala Dataframe column is present in the value of a Map

我有以下数据:

val df = Seq(
    (1, List("A")),
    (2, List("A")), 
    (3, List("A", "B")),
    (4, List("C")),
    (5, List("A")),
    (6, List("A", "C")),
    (7, List("B")),
    (8, List("A", "B", "C")),
    (9, List("A"))
  ).toDF("Serial Number", "my_list")

+--------------------+--------------------+
|       Serial Number|             my_list|
+--------------------+--------------------+
|                   1|                 [A]|
|                   2|                 [A]|
|                   3|               [A,B]|
|                   4|                 [C]|
|                   5|                 [A]|
|                   6|              [A, C]|
|                   7|                 [B]|
|                   8|           [A, B, C]|
|                   9|                 [A]|
+--------------------+--------------------+

我有地图

val category_Mapping = Map("Category1" -> [A, B], 
                  "Category2" -> [C],
                  "Category3" -> [B, D])

我想查找数据["my_list"]中的每个列表元素和return每个数据["序列号"]的输出映射,方法如下:

+--------------------+--------------------+------------------------------------------+
|       Serial Number|             my_list|                                   output |
+--------------------+--------------------+------------------------------------------+
|                   1|                 [A]|{Category1->1, Category2->0, Category3->0}|
|                   2|                 [A]|{Category1->1, Category2->0, Category3->0}|
|                   3|               [A,B]|{Category1->1, Category2->0, Category3->1}|
|                   4|                 [C]|{Category1->0, Category2->1, Category3->0}|
|                   5|                 [A]|{Category1->1, Category2->0, Category3->0}|
|                   6|              [A, C]|{Category1->1, Category2->1, Category3->0}|
|                   7|                 [B]|{Category1->1, Category2->0, Category3->1}|
|                   8|           [A, B, C]|{Category1->1, Category2->1, Category3->1}|
|                   9|                 [A]|{Category1->1, Category2->0, Category3->0}|
+--------------------+--------------------+------------------------------------------+

基本上,我想要 return 一个输出映射,如果数据 ["my_list"] 中的列表中的元素出现在 category_Mapping 中,则输出映射的值为 1。反正我能做到吗?

编辑:大约 5 小时了,没有人回答。有人可以帮我解决这个问题吗?

你可以试试这个
我是在 spark local 模式而不是集群上这样做的

// Assuming that your dataframe is stored in a variable called df

// Define a function which will return your map based on the given array in the colum n 'my_list'

def function(lst: mutable.WrappedArray[String]): Map[String, Int] = {
    var map: scala.collection.mutable.Map[String, Int] = scala.collection.mutable.Map("Category1" -> 0, "Category2" -> 0, "Category3" -> 0)
    lst.foreach { l =>
      map.keys.foreach { key =>
        if (Map("Category1" -> Array("A", "B"), "Category2" -> Array("C"), "Category3" -> Array("B", "D"))(key).contains(l))
            map(key) = 1
      }
    }
    map.toMap
}

// now you can define a udf which will just call the above defined function

val output = udf { (lst: mutable.WrappedArray[String]) => {
    function(lst)
  }
}

// now you can call the udf on the column 'my_list'

df.withColumn("output", output(col("my_list"))).show(false)

// The output will be as given below

+-------------+---------+------------------------------------------------+
|Serial Number|my_list  |output                                          |
+-------------+---------+------------------------------------------------+
|1            |[A]      |[Category2 -> 0, Category1 -> 1, Category3 -> 0]|
|2            |[A]      |[Category2 -> 0, Category1 -> 1, Category3 -> 0]|
|3            |[A, B]   |[Category2 -> 0, Category1 -> 1, Category3 -> 1]|
|4            |[C]      |[Category2 -> 1, Category1 -> 0, Category3 -> 0]|
|5            |[A]      |[Category2 -> 0, Category1 -> 1, Category3 -> 0]|
|6            |[A, C]   |[Category2 -> 1, Category1 -> 1, Category3 -> 0]|
|7            |[B]      |[Category2 -> 0, Category1 -> 1, Category3 -> 1]|
|8            |[A, B, C]|[Category2 -> 1, Category1 -> 1, Category3 -> 1]|
|9            |[A]      |[Category2 -> 0, Category1 -> 1, Category3 -> 0]|
+-------------+---------+------------------------------------------------+

要根据category_Mapping在输出列中获取映射的键,我们可以将category_Mapping变量作为参数传递给udf,并在函数中使用它来动态定义输出地图。可以按如下方式完成:

val category_Mapping = Map("Category1" -> Array("A", "B"), "Category2" -> Array("C"), "Category3" -> Array("B", "D"))

def function(lst: mutable.WrappedArray[String], category_Mapping: Map[String, Array[String]]): Map[String, Int] = {
    var map: scala.collection.mutable.Map[String, Int] = scala.collection.mutable.Map()
    lst.foreach { l =>
        category_Mapping.keys.foreach { key =>
        if(!map.contains(key))
            map(key) = 0
        if (category_Mapping(key).contains(l))
            map(key) = 1
        }
    }
    map.toMap
}

// the definition of udf has changed in this case.

def output (category_Mapping: Map[String, Array[String]]) = udf { (lst: mutable.WrappedArray[String]) => {
    function(lst,category_Mapping)
}
}

df.withColumn("output", output(category_Mapping)(col("my_list"))).show(false)