如何使用python/pyspark处理一列中的json数据？

Question

正在尝试处理 Databricks 上列中的 JSON 数据。以下是来自 table 的样本数据（它是一个天气设备记录信息）

JSON_Info
{"sampleData":"dataDetails: 1 001 2010/01/02 01:09:10 [device_info(1)] Weather=65F Wind Speed(mph)=12 UV Index=0_2 "}
{"sampleData":"dataDetails: 2 002 2010/01/02 01:10:03 [device_info(1)] Weather=66F Wind Speed(mph)=13 UV Index=0_2 "}
{"sampleData":"dataDetails: 3 003 2010/01/02 01:11:14 [device_info(1)] Weather=67F Wind Speed(mph)=14 UV Index=0_2 "}
{"sampleData":"dataDetails: 4 004 2010/01/02 01:12:23 [device_info(1)] Weather=68F Wind Speed(mph)=15 UV Index=0_2 "}

所有信息均以“sampleData”开头为key，值为如下长条信息："dataDetails: 1 001 2010/01/02 01:09:10 [device_info(1 )] 天气=65F 风速(mph)=12 紫外线指数=0_2 ".

理想情况下，我希望从值 (dataDetails) 中获取 muli 信息到不同的列中，如下所示：

Index	SI	Date	Time	DeviceNumber	WeatherDegree	WindSpeed	UVIndex
1	001	2010/01/02	01:09:10	[device_info(1)]	65F	12	0_2
2	002	2010/01/02	01:10:03	[device_info(1)]	66F	13	0_2
3	003	2010/01/02	01:11:14	[device_info(1)]	67F	14	0_2
4	004	2010/01/02	01:12:23	[device_info(1)]	68F	15	0_2

以下是我的一些想法（但不确定如何处理）：

一旦获得类似“1 001 2010/01/02 01:09:10 [device_info(1)] Weather=65F Wind Speed(mph )=12 UV Index=0_2", 把长条用space分开得到大部分信息（分开前，需要把"Wind"里的space去掉Speed”和“UV Index”到“WindSpeed”和“UVIndex”。
然后取“=”号左边作为列名（如果有等号）

简而言之，如何使用python/pyspark获取JSON数据列中值的多信息

有人可以帮忙吗？

非常感谢

Answer 1

假设您的列已经是 JSON 类型，您可以使用带有选项 sep: ' ' 的 from_csv 函数来使用空白 space 作为分隔符

from pyspark.sql import functions as F

(df
    .withColumn('JSON_Info', F.from_json('JSON_Info', 'sampleData string'))
    .select(F.from_csv('JSON_Info.sampleData', 'c1 string, Index string, SI string, Date string, Time string, DeviceNumber string, WeatherDegree string, c8 string, Wind string, c10 string, UVIndex string', {'sep': ' '}).alias('csv'))
    .select('csv.*')
    .drop('c1', 'c8', 'c10')
    .withColumn('WeatherDegree', F.split('Weather', '=')[1])
    .withColumn('Wind', F.split('Wind', '=')[1])
    .withColumn('UVIndex', F.split('UVIndex', '=')[1])
    .show()
)

+-----+---+----------+--------+----------------+-------------+----+-------+
|Index|SI |Date      |Time    |DeviceNumber    |WeatherDegree|Wind|UVIndex|
+-----+---+----------+--------+----------------+-------------+----+-------+
|1    |001|2010/01/02|01:09:10|[device_info(1)]|65F          |12  |0_2    |
|2    |002|2010/01/02|01:10:03|[device_info(1)]|66F          |13  |0_2    |
|3    |003|2010/01/02|01:11:14|[device_info(1)]|67F          |14  |0_2    |
|4    |004|2010/01/02|01:12:23|[device_info(1)]|68F          |15  |0_2    |
+-----+---+----------+--------+----------------+-------------+----+-------+

如何使用python/pyspark处理一列中的json数据？

How to processing json data in a column by using python/pyspark?

apache-spark-sql

pyspark

databricks

azure-databricks