如何使用python/pyspark处理一列中的json数据?
How to processing json data in a column by using python/pyspark?
正在尝试处理 Databricks 上列中的 JSON 数据。以下是来自 table 的样本数据(它是一个天气设备记录信息)
JSON_Info
{"sampleData":"dataDetails: 1 001 2010/01/02 01:09:10 [device_info(1)] Weather=65F Wind Speed(mph)=12 UV Index=0_2 "}
{"sampleData":"dataDetails: 2 002 2010/01/02 01:10:03 [device_info(1)] Weather=66F Wind Speed(mph)=13 UV Index=0_2 "}
{"sampleData":"dataDetails: 3 003 2010/01/02 01:11:14 [device_info(1)] Weather=67F Wind Speed(mph)=14 UV Index=0_2 "}
{"sampleData":"dataDetails: 4 004 2010/01/02 01:12:23 [device_info(1)] Weather=68F Wind Speed(mph)=15 UV Index=0_2 "}
所有信息均以“sampleData”开头为key,值为如下长条信息:"dataDetails: 1 001 2010/01/02 01:09:10 [device_info(1 )] 天气=65F 风速(mph)=12 紫外线指数=0_2 ".
理想情况下,我希望从值 (dataDetails) 中获取 muli 信息到不同的列中,如下所示:
Index
SI
Date
Time
DeviceNumber
WeatherDegree
WindSpeed
UVIndex
1
001
2010/01/02
01:09:10
[device_info(1)]
65F
12
0_2
2
002
2010/01/02
01:10:03
[device_info(1)]
66F
13
0_2
3
003
2010/01/02
01:11:14
[device_info(1)]
67F
14
0_2
4
004
2010/01/02
01:12:23
[device_info(1)]
68F
15
0_2
以下是我的一些想法(但不确定如何处理):
一旦获得类似“1 001 2010/01/02 01:09:10 [device_info(1)] Weather=65F Wind Speed(mph )=12 UV Index=0_2", 把长条用space分开得到大部分信息(分开前,需要把"Wind"里的space去掉Speed”和“UV Index”到“WindSpeed”和“UVIndex”。
然后取“=”号左边作为列名(如果有等号)
简而言之,如何使用python/pyspark获取JSON数据列中值的多信息
有人可以帮忙吗?
非常感谢
假设您的列已经是 JSON 类型,您可以使用带有选项 sep: ' '
的 from_csv
函数来使用空白 space 作为分隔符
from pyspark.sql import functions as F
(df
.withColumn('JSON_Info', F.from_json('JSON_Info', 'sampleData string'))
.select(F.from_csv('JSON_Info.sampleData', 'c1 string, Index string, SI string, Date string, Time string, DeviceNumber string, WeatherDegree string, c8 string, Wind string, c10 string, UVIndex string', {'sep': ' '}).alias('csv'))
.select('csv.*')
.drop('c1', 'c8', 'c10')
.withColumn('WeatherDegree', F.split('Weather', '=')[1])
.withColumn('Wind', F.split('Wind', '=')[1])
.withColumn('UVIndex', F.split('UVIndex', '=')[1])
.show()
)
+-----+---+----------+--------+----------------+-------------+----+-------+
|Index|SI |Date |Time |DeviceNumber |WeatherDegree|Wind|UVIndex|
+-----+---+----------+--------+----------------+-------------+----+-------+
|1 |001|2010/01/02|01:09:10|[device_info(1)]|65F |12 |0_2 |
|2 |002|2010/01/02|01:10:03|[device_info(1)]|66F |13 |0_2 |
|3 |003|2010/01/02|01:11:14|[device_info(1)]|67F |14 |0_2 |
|4 |004|2010/01/02|01:12:23|[device_info(1)]|68F |15 |0_2 |
+-----+---+----------+--------+----------------+-------------+----+-------+
正在尝试处理 Databricks 上列中的 JSON 数据。以下是来自 table 的样本数据(它是一个天气设备记录信息)
JSON_Info |
---|
{"sampleData":"dataDetails: 1 001 2010/01/02 01:09:10 [device_info(1)] Weather=65F Wind Speed(mph)=12 UV Index=0_2 "} |
{"sampleData":"dataDetails: 2 002 2010/01/02 01:10:03 [device_info(1)] Weather=66F Wind Speed(mph)=13 UV Index=0_2 "} |
{"sampleData":"dataDetails: 3 003 2010/01/02 01:11:14 [device_info(1)] Weather=67F Wind Speed(mph)=14 UV Index=0_2 "} |
{"sampleData":"dataDetails: 4 004 2010/01/02 01:12:23 [device_info(1)] Weather=68F Wind Speed(mph)=15 UV Index=0_2 "} |
所有信息均以“sampleData”开头为key,值为如下长条信息:"dataDetails: 1 001 2010/01/02 01:09:10 [device_info(1 )] 天气=65F 风速(mph)=12 紫外线指数=0_2 ".
理想情况下,我希望从值 (dataDetails) 中获取 muli 信息到不同的列中,如下所示:
Index | SI | Date | Time | DeviceNumber | WeatherDegree | WindSpeed | UVIndex |
---|---|---|---|---|---|---|---|
1 | 001 | 2010/01/02 | 01:09:10 | [device_info(1)] | 65F | 12 | 0_2 |
2 | 002 | 2010/01/02 | 01:10:03 | [device_info(1)] | 66F | 13 | 0_2 |
3 | 003 | 2010/01/02 | 01:11:14 | [device_info(1)] | 67F | 14 | 0_2 |
4 | 004 | 2010/01/02 | 01:12:23 | [device_info(1)] | 68F | 15 | 0_2 |
以下是我的一些想法(但不确定如何处理):
一旦获得类似“1 001 2010/01/02 01:09:10 [device_info(1)] Weather=65F Wind Speed(mph )=12 UV Index=0_2", 把长条用space分开得到大部分信息(分开前,需要把"Wind"里的space去掉Speed”和“UV Index”到“WindSpeed”和“UVIndex”。
然后取“=”号左边作为列名(如果有等号)
简而言之,如何使用python/pyspark获取JSON数据列中值的多信息
有人可以帮忙吗?
非常感谢
假设您的列已经是 JSON 类型,您可以使用带有选项 sep: ' '
的 from_csv
函数来使用空白 space 作为分隔符
from pyspark.sql import functions as F
(df
.withColumn('JSON_Info', F.from_json('JSON_Info', 'sampleData string'))
.select(F.from_csv('JSON_Info.sampleData', 'c1 string, Index string, SI string, Date string, Time string, DeviceNumber string, WeatherDegree string, c8 string, Wind string, c10 string, UVIndex string', {'sep': ' '}).alias('csv'))
.select('csv.*')
.drop('c1', 'c8', 'c10')
.withColumn('WeatherDegree', F.split('Weather', '=')[1])
.withColumn('Wind', F.split('Wind', '=')[1])
.withColumn('UVIndex', F.split('UVIndex', '=')[1])
.show()
)
+-----+---+----------+--------+----------------+-------------+----+-------+
|Index|SI |Date |Time |DeviceNumber |WeatherDegree|Wind|UVIndex|
+-----+---+----------+--------+----------------+-------------+----+-------+
|1 |001|2010/01/02|01:09:10|[device_info(1)]|65F |12 |0_2 |
|2 |002|2010/01/02|01:10:03|[device_info(1)]|66F |13 |0_2 |
|3 |003|2010/01/02|01:11:14|[device_info(1)]|67F |14 |0_2 |
|4 |004|2010/01/02|01:12:23|[device_info(1)]|68F |15 |0_2 |
+-----+---+----------+--------+----------------+-------------+----+-------+