从 pyspark 数据框中的数组中提取元素
Extract elements from the array in pyspark dataframe
我有一个如下所示的数据框
+----------+------------------------------------------------------------------------------------+
|CustomerNo|recommendations |
+----------+------------------------------------------------------------------------------------+
|76 |[{1, 0.89554673}, {5, 0.85469896}, {2, 0.84503603}, {0, 0.80415034}, {6, 0.6815199}]|
|336 |[{1, 1.0019907}, {5, 0.9514036}, {4, 0.83544296}, {0, 0.76875824}, {7, 0.7413829}] |
|654 |[{5, 1.0243652}, {1, 0.9433953}, {6, 0.81832266}, {7, 0.69486576}, {8, 0.6834659}] |
架构:
root
|-- CustomerNo: integer (nullable = false)
|-- recommendations: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- category: integer (nullable = true)
| | |-- rating: float (nullable = true)
我必须从每一行的每个列表中提取键。例如,输出 table 应该像
+----------+------------------------------------------------------------------------------------+
|CustomerNo|recommendations |
+----------+------------------------------------------------------------------------------------+
|76 |[1,5,2,0,6] |
|336 |[1,5,4,0,7] |
|654 |[5,1,6,7,8] |
谁能告诉我如何实现这个?
你试过了吗
+----------+------------------------------------------------------------------------------------+
|CustomerNo|recommendations |
+----------+------------------------------------------------------------------------------------+
|76 |[{1, 0.89554673}, {5, 0.85469896}, {2, 0.84503603}, {0, 0.80415034}, {6, 0.6815199}]|
|336 |[{1, 1.0019907}, {5, 0.9514036}, {4, 0.83544296}, {0, 0.76875824}, {7, 0.7413829}] |
|654 |[{5, 1.0243652}, {1, 0.9433953}, {6, 0.81832266}, {7, 0.69486576}, {8, 0.6834659}] |
+----------+------------------------------------------------------------------------------------+
df.select('CustomerNo',col('recommendations.category').alias("recommendations")).show()
+----------+---------------+
|CustomerNo|recommendations|
+----------+---------------+
| 76|[1, 5, 2, 0, 6]|
| 336|[1, 5, 4, 0, 7]|
| 654|[5, 1, 6, 7, 8]|
+----------+---------------+
我有一个如下所示的数据框
+----------+------------------------------------------------------------------------------------+
|CustomerNo|recommendations |
+----------+------------------------------------------------------------------------------------+
|76 |[{1, 0.89554673}, {5, 0.85469896}, {2, 0.84503603}, {0, 0.80415034}, {6, 0.6815199}]|
|336 |[{1, 1.0019907}, {5, 0.9514036}, {4, 0.83544296}, {0, 0.76875824}, {7, 0.7413829}] |
|654 |[{5, 1.0243652}, {1, 0.9433953}, {6, 0.81832266}, {7, 0.69486576}, {8, 0.6834659}] |
架构:
root
|-- CustomerNo: integer (nullable = false)
|-- recommendations: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- category: integer (nullable = true)
| | |-- rating: float (nullable = true)
我必须从每一行的每个列表中提取键。例如,输出 table 应该像
+----------+------------------------------------------------------------------------------------+
|CustomerNo|recommendations |
+----------+------------------------------------------------------------------------------------+
|76 |[1,5,2,0,6] |
|336 |[1,5,4,0,7] |
|654 |[5,1,6,7,8] |
谁能告诉我如何实现这个?
你试过了吗
+----------+------------------------------------------------------------------------------------+
|CustomerNo|recommendations |
+----------+------------------------------------------------------------------------------------+
|76 |[{1, 0.89554673}, {5, 0.85469896}, {2, 0.84503603}, {0, 0.80415034}, {6, 0.6815199}]|
|336 |[{1, 1.0019907}, {5, 0.9514036}, {4, 0.83544296}, {0, 0.76875824}, {7, 0.7413829}] |
|654 |[{5, 1.0243652}, {1, 0.9433953}, {6, 0.81832266}, {7, 0.69486576}, {8, 0.6834659}] |
+----------+------------------------------------------------------------------------------------+
df.select('CustomerNo',col('recommendations.category').alias("recommendations")).show()
+----------+---------------+
|CustomerNo|recommendations|
+----------+---------------+
| 76|[1, 5, 2, 0, 6]|
| 336|[1, 5, 4, 0, 7]|
| 654|[5, 1, 6, 7, 8]|
+----------+---------------+