在 Python 中将 json 字符串转换为 spark 中的数据帧
Converting json strings to dataframe in spark in Python
(Databricks 上的 Apache Spark 版本 2.3.1)
你好,我有一个 JSON 看起来像这样的转储
[{"standings": {"visitorteam_position": 1, "localteam_position": 1}, "season_id": 892, "pitch": null, "commentaries": null, "id": 10342083, "venue_id": 273277, "formations": {"localteam_formation": null, "visitorteam_formation": null}, "aggregate_id": null, "round_id": null, "visitorteam_id": 18647, "winning_odds_calculated": false, "deleted": false, "coaches": {"localteam_coach_id": 472158, "visitorteam_coach_id": 474616}, "attendance": null, "scores": {"ft_score": null, "visitorteam_score": 0, "et_score": null, "localteam_pen_score": null, "visitorteam_pen_score": null, "localteam_score": 0, "ht_score": null}, "referee_id": 18783, "stage_id": 1728, "weather_report": null, "league_id": 732, "localteam_id": 15251, "time": {"status": "NS", "starting_at": {"date": "2018-07-06", "date_time": "2018-07-06 14:00:00", "timezone": "UTC", "timestamp": 1530885600, "time": "14:00:00"}, "extra_minute": null, "injury_time": null, "second": null, "added_time": null, "minute": null}, "group_id": null}, {"standings": {"visitorteam_position": 1, "localteam_position": 1}, "season_id": 892, "pitch": null, "commentaries": null, "id": 10344350, "venue_id": 8869, "formations": {"localteam_formation": null, "visitorteam_formation": null}, "aggregate_id": null, "round_id": null, "visitorteam_id": 18743, "winning_odds_calculated": false, "deleted": false, "coaches": {"localteam_coach_id": 474720, "visitorteam_coach_id": 474796}, "attendance": null, "scores": {"ft_score": null, "visitorteam_score": 0, "et_score": null, "localteam_pen_score": null, "visitorteam_pen_score": null, "localteam_score": 0, "ht_score": null}, "referee_id": 16781, "stage_id": 1728, "weather_report": null, "league_id": 732, "localteam_id": 18704, "time": {"status": "NS", "starting_at": {"date": "2018-07-06", "date_time": "2018-07-06 18:00:00", "timezone": "UTC", "timestamp": 1530900000, "time": "18:00:00"}, "extra_minute": null, "injury_time": null, "second": null, "added_time": null, "minute": null}, "group_id": null}]
我正在尝试将其直接从变量而不是 JSON 文件上传转换为数据帧;主要是因为我从 GET 请求到 API.
获取了 JSON 数据
这是我的转换代码 -
countries = spark.read.option("multiline", "true").json(json.dumps(ts)).show(false)
给我这个错误,请指出正确的方向。我四处查看,但只看到 Scala 的解决方案。寻找相同的 Python 修复。
IllegalArgumentException: u'java.net.URISyntaxException: Relative path
in absolute URI:
"[{\"standings\":%20%7B%5C%22visitorteam_position%5C%22:%201,%20%5C%22localteam_position%5C%22:%201%7D,%20%5C%22season_id%5C%22:%20892,%20%5C%22pitch%5C%22:%20null,%20%5C%22commentaries%5C%22:%20null,%20%5C%22id%5C%22:%2010342083,%20%5C%22venue_id%5C%22:%20273277,%20%5C%22formations%5C%22:%20%7B%5C%22localteam_formation%5C%22:%20null,%20%5C%22visitorteam_formation%5C%22:%20null%7D,%20%5C%22aggregate_id%5C%22:%20null,%20%5C%22round_id%5C%22:%20null,%20%5C%22visitorteam_id%5C%22:%2018647,%20%5C%22winning_odds_calculated%5C%22:%20false,%20%5C%22deleted%5C%22:%20false,%20%5C%22coaches%5C%22:%20%7B%5C%22localteam_coach_id%5C%22:%20472158,%20%5C%22visitorteam_coach_id%5C%22:%20474616%7D,%20%5C%22attendance%5C%22:%20null,%20%5C%22scores%5C%22:%20%7B%5C%22ft_score%5C%22:%20null,%20%5C%22visitorteam_score%5C%22:%200,%20%5C%22et_score%5C%22:%20null,%20%5C%22localteam_pen_score%5C%22:%20null,%20%5C%22visitorteam_pen_score%5C%22:%20null,%20%5C%22localteam_score%5C%22:%200,%20%5C%22ht_score%5C%22:%20null%7D,%20%5C%22referee_id%5C%22:%2018783,%20%5C%22stage_id%5C%22:%201728,%20%5C%22weather_report%5C%22:%20null,%20%5C%22league_id%5C%22:%20732,%20%5C%22localteam_id%5C%22:%2015251,%20%5C%22time%5C%22:%20%7B%5C%22status%5C%22:%20%5C%22NS%5C%22,%20%5C%22starting_at%5C%22:%20%7B%5C%22date%5C%22:%20%5C%222018-07-06%5C%22,%20%5C%22date_time%5C%22:%20%5C%222018-07-06%2014:00:00%5C%22,%20%5C%22timezone%5C%22:%20%5C%22UTC%5C%22,%20%5C%22timestamp%5C%22:%201530885600,%20%5C%22time%5C%22:%20%5C%2214:00:00%5C%22%7D,%20%5C%22extra_minute%5C%22:%20null,%20%5C%22injury_time%5C%22:%20null,%20%5C%22second%5C%22:%20null,%20%5C%22added_time%5C%22:%20null,%20%5C%22minute%5C%22:%20null%7D,%20%5C%22group_id%5C%22:%20null%7D,%20%7B%5C%22standings%5C%22:%20%7B%5C%22visitorteam_position%5C%22:%201,%20%5C%22localteam_position%5C%22:%201%7D,%20%5C%22season_id%5C%22:%20892,%20%5C%22pitch%5C%22:%20null,%20%5C%22commentaries%5C%22:%20null,%20%5C%22id%5C%22:%2010344350,%20%5C%22venue_id%5C%22:%208869,%20%5C%22formations%5C%22:%20%7B%5C%22localteam_formation%5C%22:%20null,%20%5C%22visitorteam_formation%5C%22:%20null%7D,%20%5C%22aggregate_id%5C%22:%20null,%20%5C%22round_id%5C%22:%20null,%20%5C%22visitorteam_id%5C%22:%2018743,%20%5C%22winning_odds_calculated%5C%22:%20false,%20%5C%22deleted%5C%22:%20false,%20%5C%22coaches%5C%22:%20%7B%5C%22localteam_coach_id%5C%22:%20474720,%20%5C%22visitorteam_coach_id%5C%22:%20474796%7D,%20%5C%22attendance%5C%22:%20null,%20%5C%22scores%5C%22:%20%7B%5C%22ft_score%5C%22:%20null,%20%5C%22visitorteam_score%5C%22:%200,%20%5C%22et_score%5C%22:%20null,%20%5C%22localteam_pen_score%5C%22:%20null,%20%5C%22visitorteam_pen_score%5C%22:%20null,%20%5C%22localteam_score%5C%22:%200,%20%5C%22ht_score%5C%22:%20null%7D,%20%5C%22referee_id%5C%22:%2016781,%20%5C%22stage_id%5C%22:%201728,%20%5C%22weather_report%5C%22:%20null,%20%5C%22league_id%5C%22:%20732,%20%5C%22localteam_id%5C%22:%2018704,%20%5C%22time%5C%22:%20%7B%5C%22status%5C%22:%20%5C%22NS%5C%22,%20%5C%22starting_at%5C%22:%20%7B%5C%22date%5C%22:%20%5C%222018-07-06%5C%22,%20%5C%22date_time%5C%22:%20%5C%222018-07-06%2018:00:00%5C%22,%20%5C%22timezone%5C%22:%20%5C%22UTC%5C%22,%20%5C%22timestamp%5C%22:%201530900000,%20%5C%22time%5C%22:%20%5C%2218:00:00%5C%22%7D,%20%5C%22extra_minute%5C%22:%20null,%20%5C%22injury_time%5C%22:%20null,%20%5C%22second%5C%22:%20null,%20%5C%22added_time%5C%22:%20null,%20%5C%22minute%5C%22:%20null%7D,%20%5C%22group_id%5C%22:%20null%7D%5D%22'
的输出
打印(ts)
Out[45]:
[{u'aggregate_id': None,
u'attendance': None,
u'coaches': {u'localteam_coach_id': 472158, u'visitorteam_coach_id': 474616},
u'commentaries': None,
u'deleted': False,
u'formations': {u'localteam_formation': None,
u'visitorteam_formation': None},
u'group_id': None,
u'id': 10342083,
u'league_id': 732,
u'localteam_id': 15251,
u'pitch': None,
u'referee_id': 18783,
u'round_id': None,
u'scores': {u'et_score': None,
u'ft_score': None,
u'ht_score': None,
u'localteam_pen_score': None,
u'localteam_score': 0,
u'visitorteam_pen_score': None,
u'visitorteam_score': 0},
u'season_id': 892,
u'stage_id': 1728,
u'standings': {u'localteam_position': 1, u'visitorteam_position': 1},
u'time': {u'added_time': None,
u'extra_minute': None,
u'injury_time': None,
u'minute': None,
u'second': None,
u'starting_at': {u'date': u'2018-07-06',
u'date_time': u'2018-07-06 14:00:00',
u'time': u'14:00:00',
u'timestamp': 1530885600,
u'timezone': u'UTC'},
u'status': u'NS'},
u'venue_id': 273277,
u'visitorteam_id': 18647,
u'weather_report': None,
u'winning_odds_calculated': False},
{u'aggregate_id': None,
u'attendance': None,
u'coaches': {u'localteam_coach_id': 474720, u'visitorteam_coach_id': 474796},
u'commentaries': None,
u'deleted': False,
u'formations': {u'localteam_formation': None,
u'visitorteam_formation': None},
u'group_id': None,
u'id': 10344350,
u'league_id': 732,
u'localteam_id': 18704,
u'pitch': None,
u'referee_id': 16781,
u'round_id': None,
u'scores': {u'et_score': None,
u'ft_score': None,
u'ht_score': None,
u'localteam_pen_score': None,
u'localteam_score': 0,
u'visitorteam_pen_score': None,
u'visitorteam_score': 0},
u'season_id': 892,
u'stage_id': 1728,
u'standings': {u'localteam_position': 1, u'visitorteam_position': 1},
u'time': {u'added_time': None,
u'extra_minute': None,
u'injury_time': None,
u'minute': None,
u'second': None,
u'starting_at': {u'date': u'2018-07-06',
u'date_time': u'2018-07-06 18:00:00',
u'time': u'18:00:00',
u'timestamp': 1530900000,
u'timezone': u'UTC'},
u'status': u'NS'},
u'venue_id': 8869,
u'visitorteam_id': 18743,
u'weather_report': None,
u'winning_odds_calculated': False}]
打印(json.dumps(ts))
Out[44]: '[{"standings": {"visitorteam_position": 1, "localteam_position": 1}, "season_id": 892, "pitch": null, "commentaries": null, "id": 10342083, "venue_id": 273277, "formations": {"localteam_formation": null, "visitorteam_formation": null}, "aggregate_id": null, "round_id": null, "visitorteam_id": 18647, "winning_odds_calculated": false, "deleted": false, "coaches": {"localteam_coach_id": 472158, "visitorteam_coach_id": 474616}, "attendance": null, "scores": {"ft_score": null, "visitorteam_score": 0, "et_score": null, "localteam_pen_score": null, "visitorteam_pen_score": null, "localteam_score": 0, "ht_score": null}, "referee_id": 18783, "stage_id": 1728, "weather_report": null, "league_id": 732, "localteam_id": 15251, "time": {"status": "NS", "starting_at": {"date": "2018-07-06", "date_time": "2018-07-06 14:00:00", "timezone": "UTC", "timestamp": 1530885600, "time": "14:00:00"}, "extra_minute": null, "injury_time": null, "second": null, "added_time": null, "minute": null}, "group_id": null}, {"standings": {"visitorteam_position": 1, "localteam_position": 1}, "season_id": 892, "pitch": null, "commentaries": null, "id": 10344350, "venue_id": 8869, "formations": {"localteam_formation": null, "visitorteam_formation": null}, "aggregate_id": null, "round_id": null, "visitorteam_id": 18743, "winning_odds_calculated": false, "deleted": false, "coaches": {"localteam_coach_id": 474720, "visitorteam_coach_id": 474796}, "attendance": null, "scores": {"ft_score": null, "visitorteam_score": 0, "et_score": null, "localteam_pen_score": null, "visitorteam_pen_score": null, "localteam_score": 0, "ht_score": null}, "referee_id": 16781, "stage_id": 1728, "weather_report": null, "league_id": 732, "localteam_id": 18704, "time": {"status": "NS", "starting_at": {"date": "2018-07-06", "date_time": "2018-07-06 18:00:00", "timezone": "UTC", "timestamp": 1530900000, "time": "18:00:00"}, "extra_minute": null, "injury_time": null, "second": null, "added_time": null, "minute": null}, "group_id": null}]'
提前致谢!
PS。 - 这是 link 如何使用 Scala 做到这一点 - http://spark.apache.org/docs/2.2.0/sql-programming-guide.html#tab_scala_5
你说
I am trying to convert it to a dataframe directly from a variable instead of a JSON file upload; mainly because I get the JSON data from a GET request to an API.
所以我假设 ts 是一个像
这样的变量
ts = """[{"standings": {"visitorteam_position": 1, "localteam_position": 1}, "season_id": 892, "pitch": null, "commentaries": null, "id": 10342083, "venue_id": 273277, "formations": {"localteam_formation": null, "visitorteam_formation": null}, "aggregate_id": null, "round_id": null, "visitorteam_id": 18647, "winning_odds_calculated": false, "deleted": false, "coaches": {"localteam_coach_id": 472158, "visitorteam_coach_id": 474616}, "attendance": null, "scores": {"ft_score": null, "visitorteam_score": 0, "et_score": null, "localteam_pen_score": null, "visitorteam_pen_score": null, "localteam_score": 0, "ht_score": null}, "referee_id": 18783, "stage_id": 1728, "weather_report": null, "league_id": 732, "localteam_id": 15251, "time": {"status": "NS", "starting_at": {"date": "2018-07-06", "date_time": "2018-07-06 14:00:00", "timezone": "UTC", "timestamp": 1530885600, "time": "14:00:00"}, "extra_minute": null, "injury_time": null, "second": null, "added_time": null, "minute": null}, "group_id": null}, {"standings": {"visitorteam_position": 1, "localteam_position": 1}, "season_id": 892, "pitch": null, "commentaries": null, "id": 10344350, "venue_id": 8869, "formations": {"localteam_formation": null, "visitorteam_formation": null}, "aggregate_id": null, "round_id": null, "visitorteam_id": 18743, "winning_odds_calculated": false, "deleted": false, "coaches": {"localteam_coach_id": 474720, "visitorteam_coach_id": 474796}, "attendance": null, "scores": {"ft_score": null, "visitorteam_score": 0, "et_score": null, "localteam_pen_score": null, "visitorteam_pen_score": null, "localteam_score": 0, "ht_score": null}, "referee_id": 16781, "stage_id": 1728, "weather_report": null, "league_id": 732, "localteam_id": 18704, "time": {"status": "NS", "starting_at": {"date": "2018-07-06", "date_time": "2018-07-06 18:00:00", "timezone": "UTC", "timestamp": 1530900000, "time": "18:00:00"}, "extra_minute": null, "injury_time": null, "second": null, "added_time": null, "minute": null}, "group_id": null}]"""
现在,json.dumps(ts)
会给你一个字符串,.json(json.dumps(ts))
将 json.dumps(ts)
视为路径,这就是错误消息提示你的内容
IllegalArgumentException: u'java.net.URISyntaxException: Relative path in absolute URI: "[{\"standings\":%20%7B%5C%22visitorteam_position%5C%22:%201,%20%5C%22localteam_position%5C%22:%201%7D,%20%5C%22season_id%5C%22:%20892,%20%5C
并且 API 文档说明如下
.... :param path: string represents path to the JSON dataset, or a list of paths, or RDD of Strings storing JSON objects. .......
因此,如果您想使用变量 ts
,那么,如 API 文档所述,您必须将字符串 json.dumps(ts)
转换为 RDD
,如
tsRDD = sc.parallelize([ts])
df = spark.read.option('multiline', "true").json(tsRDD)
应该给出正确的数据框
+------------+----------+----------------+------------+-------+----------+--------+--------+---------+------------+-----+----------+--------+------------+---------+--------+---------+------------------------------------------------------------------------+--------+--------------+--------------+-----------------------+
|aggregate_id|attendance|coaches |commentaries|deleted|formations|group_id|id |league_id|localteam_id|pitch|referee_id|round_id|scores |season_id|stage_id|standings|time |venue_id|visitorteam_id|weather_report|winning_odds_calculated|
+------------+----------+----------------+------------+-------+----------+--------+--------+---------+------------+-----+----------+--------+------------+---------+--------+---------+------------------------------------------------------------------------+--------+--------------+--------------+-----------------------+
|null |null |[472158, 474616]|null |false |[,] |null |10342083|732 |15251 |null |18783 |null |[,,,, 0,, 0]|892 |1728 |[1, 1] |[,,,,, [2018-07-06, 2018-07-06 14:00:00, 14:00:00, 1530885600, UTC], NS]|273277 |18647 |null |false |
|null |null |[474720, 474796]|null |false |[,] |null |10344350|732 |18704 |null |16781 |null |[,,,, 0,, 0]|892 |1728 |[1, 1] |[,,,,, [2018-07-06, 2018-07-06 18:00:00, 18:00:00, 1530900000, UTC], NS]|8869 |18743 |null |false |
+------------+----------+----------------+------------+-------+----------+--------+--------+---------+------------+-----+----------+--------+------------+---------+--------+---------+------------------------------------------------------------------------+--------+--------------+--------------+-----------------------+
root
|-- aggregate_id: string (nullable = true)
|-- attendance: string (nullable = true)
|-- coaches: struct (nullable = true)
| |-- localteam_coach_id: long (nullable = true)
| |-- visitorteam_coach_id: long (nullable = true)
|-- commentaries: string (nullable = true)
|-- deleted: boolean (nullable = true)
|-- formations: struct (nullable = true)
| |-- localteam_formation: string (nullable = true)
| |-- visitorteam_formation: string (nullable = true)
|-- group_id: string (nullable = true)
|-- id: long (nullable = true)
|-- league_id: long (nullable = true)
|-- localteam_id: long (nullable = true)
|-- pitch: string (nullable = true)
|-- referee_id: long (nullable = true)
|-- round_id: string (nullable = true)
|-- scores: struct (nullable = true)
| |-- et_score: string (nullable = true)
| |-- ft_score: string (nullable = true)
| |-- ht_score: string (nullable = true)
| |-- localteam_pen_score: string (nullable = true)
| |-- localteam_score: long (nullable = true)
| |-- visitorteam_pen_score: string (nullable = true)
| |-- visitorteam_score: long (nullable = true)
|-- season_id: long (nullable = true)
|-- stage_id: long (nullable = true)
|-- standings: struct (nullable = true)
| |-- localteam_position: long (nullable = true)
| |-- visitorteam_position: long (nullable = true)
|-- time: struct (nullable = true)
| |-- added_time: string (nullable = true)
| |-- extra_minute: string (nullable = true)
| |-- injury_time: string (nullable = true)
| |-- minute: string (nullable = true)
| |-- second: string (nullable = true)
| |-- starting_at: struct (nullable = true)
| | |-- date: string (nullable = true)
| | |-- date_time: string (nullable = true)
| | |-- time: string (nullable = true)
| | |-- timestamp: long (nullable = true)
| | |-- timezone: string (nullable = true)
| |-- status: string (nullable = true)
|-- venue_id: long (nullable = true)
|-- visitorteam_id: long (nullable = true)
|-- weather_report: string (nullable = true)
|-- winning_odds_calculated: boolean (nullable = true)
或者您可以将变量保存在文件中并使用
df = spark.read.option('multiline', "true").json(path to the file)
与上述建议一样完美
希望回答对你有帮助
(Databricks 上的 Apache Spark 版本 2.3.1)
你好,我有一个 JSON 看起来像这样的转储
[{"standings": {"visitorteam_position": 1, "localteam_position": 1}, "season_id": 892, "pitch": null, "commentaries": null, "id": 10342083, "venue_id": 273277, "formations": {"localteam_formation": null, "visitorteam_formation": null}, "aggregate_id": null, "round_id": null, "visitorteam_id": 18647, "winning_odds_calculated": false, "deleted": false, "coaches": {"localteam_coach_id": 472158, "visitorteam_coach_id": 474616}, "attendance": null, "scores": {"ft_score": null, "visitorteam_score": 0, "et_score": null, "localteam_pen_score": null, "visitorteam_pen_score": null, "localteam_score": 0, "ht_score": null}, "referee_id": 18783, "stage_id": 1728, "weather_report": null, "league_id": 732, "localteam_id": 15251, "time": {"status": "NS", "starting_at": {"date": "2018-07-06", "date_time": "2018-07-06 14:00:00", "timezone": "UTC", "timestamp": 1530885600, "time": "14:00:00"}, "extra_minute": null, "injury_time": null, "second": null, "added_time": null, "minute": null}, "group_id": null}, {"standings": {"visitorteam_position": 1, "localteam_position": 1}, "season_id": 892, "pitch": null, "commentaries": null, "id": 10344350, "venue_id": 8869, "formations": {"localteam_formation": null, "visitorteam_formation": null}, "aggregate_id": null, "round_id": null, "visitorteam_id": 18743, "winning_odds_calculated": false, "deleted": false, "coaches": {"localteam_coach_id": 474720, "visitorteam_coach_id": 474796}, "attendance": null, "scores": {"ft_score": null, "visitorteam_score": 0, "et_score": null, "localteam_pen_score": null, "visitorteam_pen_score": null, "localteam_score": 0, "ht_score": null}, "referee_id": 16781, "stage_id": 1728, "weather_report": null, "league_id": 732, "localteam_id": 18704, "time": {"status": "NS", "starting_at": {"date": "2018-07-06", "date_time": "2018-07-06 18:00:00", "timezone": "UTC", "timestamp": 1530900000, "time": "18:00:00"}, "extra_minute": null, "injury_time": null, "second": null, "added_time": null, "minute": null}, "group_id": null}]
我正在尝试将其直接从变量而不是 JSON 文件上传转换为数据帧;主要是因为我从 GET 请求到 API.
获取了 JSON 数据这是我的转换代码 -
countries = spark.read.option("multiline", "true").json(json.dumps(ts)).show(false)
给我这个错误,请指出正确的方向。我四处查看,但只看到 Scala 的解决方案。寻找相同的 Python 修复。
IllegalArgumentException: u'java.net.URISyntaxException: Relative path in absolute URI: "[{\"standings\":%20%7B%5C%22visitorteam_position%5C%22:%201,%20%5C%22localteam_position%5C%22:%201%7D,%20%5C%22season_id%5C%22:%20892,%20%5C%22pitch%5C%22:%20null,%20%5C%22commentaries%5C%22:%20null,%20%5C%22id%5C%22:%2010342083,%20%5C%22venue_id%5C%22:%20273277,%20%5C%22formations%5C%22:%20%7B%5C%22localteam_formation%5C%22:%20null,%20%5C%22visitorteam_formation%5C%22:%20null%7D,%20%5C%22aggregate_id%5C%22:%20null,%20%5C%22round_id%5C%22:%20null,%20%5C%22visitorteam_id%5C%22:%2018647,%20%5C%22winning_odds_calculated%5C%22:%20false,%20%5C%22deleted%5C%22:%20false,%20%5C%22coaches%5C%22:%20%7B%5C%22localteam_coach_id%5C%22:%20472158,%20%5C%22visitorteam_coach_id%5C%22:%20474616%7D,%20%5C%22attendance%5C%22:%20null,%20%5C%22scores%5C%22:%20%7B%5C%22ft_score%5C%22:%20null,%20%5C%22visitorteam_score%5C%22:%200,%20%5C%22et_score%5C%22:%20null,%20%5C%22localteam_pen_score%5C%22:%20null,%20%5C%22visitorteam_pen_score%5C%22:%20null,%20%5C%22localteam_score%5C%22:%200,%20%5C%22ht_score%5C%22:%20null%7D,%20%5C%22referee_id%5C%22:%2018783,%20%5C%22stage_id%5C%22:%201728,%20%5C%22weather_report%5C%22:%20null,%20%5C%22league_id%5C%22:%20732,%20%5C%22localteam_id%5C%22:%2015251,%20%5C%22time%5C%22:%20%7B%5C%22status%5C%22:%20%5C%22NS%5C%22,%20%5C%22starting_at%5C%22:%20%7B%5C%22date%5C%22:%20%5C%222018-07-06%5C%22,%20%5C%22date_time%5C%22:%20%5C%222018-07-06%2014:00:00%5C%22,%20%5C%22timezone%5C%22:%20%5C%22UTC%5C%22,%20%5C%22timestamp%5C%22:%201530885600,%20%5C%22time%5C%22:%20%5C%2214:00:00%5C%22%7D,%20%5C%22extra_minute%5C%22:%20null,%20%5C%22injury_time%5C%22:%20null,%20%5C%22second%5C%22:%20null,%20%5C%22added_time%5C%22:%20null,%20%5C%22minute%5C%22:%20null%7D,%20%5C%22group_id%5C%22:%20null%7D,%20%7B%5C%22standings%5C%22:%20%7B%5C%22visitorteam_position%5C%22:%201,%20%5C%22localteam_position%5C%22:%201%7D,%20%5C%22season_id%5C%22:%20892,%20%5C%22pitch%5C%22:%20null,%20%5C%22commentaries%5C%22:%20null,%20%5C%22id%5C%22:%2010344350,%20%5C%22venue_id%5C%22:%208869,%20%5C%22formations%5C%22:%20%7B%5C%22localteam_formation%5C%22:%20null,%20%5C%22visitorteam_formation%5C%22:%20null%7D,%20%5C%22aggregate_id%5C%22:%20null,%20%5C%22round_id%5C%22:%20null,%20%5C%22visitorteam_id%5C%22:%2018743,%20%5C%22winning_odds_calculated%5C%22:%20false,%20%5C%22deleted%5C%22:%20false,%20%5C%22coaches%5C%22:%20%7B%5C%22localteam_coach_id%5C%22:%20474720,%20%5C%22visitorteam_coach_id%5C%22:%20474796%7D,%20%5C%22attendance%5C%22:%20null,%20%5C%22scores%5C%22:%20%7B%5C%22ft_score%5C%22:%20null,%20%5C%22visitorteam_score%5C%22:%200,%20%5C%22et_score%5C%22:%20null,%20%5C%22localteam_pen_score%5C%22:%20null,%20%5C%22visitorteam_pen_score%5C%22:%20null,%20%5C%22localteam_score%5C%22:%200,%20%5C%22ht_score%5C%22:%20null%7D,%20%5C%22referee_id%5C%22:%2016781,%20%5C%22stage_id%5C%22:%201728,%20%5C%22weather_report%5C%22:%20null,%20%5C%22league_id%5C%22:%20732,%20%5C%22localteam_id%5C%22:%2018704,%20%5C%22time%5C%22:%20%7B%5C%22status%5C%22:%20%5C%22NS%5C%22,%20%5C%22starting_at%5C%22:%20%7B%5C%22date%5C%22:%20%5C%222018-07-06%5C%22,%20%5C%22date_time%5C%22:%20%5C%222018-07-06%2018:00:00%5C%22,%20%5C%22timezone%5C%22:%20%5C%22UTC%5C%22,%20%5C%22timestamp%5C%22:%201530900000,%20%5C%22time%5C%22:%20%5C%2218:00:00%5C%22%7D,%20%5C%22extra_minute%5C%22:%20null,%20%5C%22injury_time%5C%22:%20null,%20%5C%22second%5C%22:%20null,%20%5C%22added_time%5C%22:%20null,%20%5C%22minute%5C%22:%20null%7D,%20%5C%22group_id%5C%22:%20null%7D%5D%22'
的输出
打印(ts)
Out[45]:
[{u'aggregate_id': None,
u'attendance': None,
u'coaches': {u'localteam_coach_id': 472158, u'visitorteam_coach_id': 474616},
u'commentaries': None,
u'deleted': False,
u'formations': {u'localteam_formation': None,
u'visitorteam_formation': None},
u'group_id': None,
u'id': 10342083,
u'league_id': 732,
u'localteam_id': 15251,
u'pitch': None,
u'referee_id': 18783,
u'round_id': None,
u'scores': {u'et_score': None,
u'ft_score': None,
u'ht_score': None,
u'localteam_pen_score': None,
u'localteam_score': 0,
u'visitorteam_pen_score': None,
u'visitorteam_score': 0},
u'season_id': 892,
u'stage_id': 1728,
u'standings': {u'localteam_position': 1, u'visitorteam_position': 1},
u'time': {u'added_time': None,
u'extra_minute': None,
u'injury_time': None,
u'minute': None,
u'second': None,
u'starting_at': {u'date': u'2018-07-06',
u'date_time': u'2018-07-06 14:00:00',
u'time': u'14:00:00',
u'timestamp': 1530885600,
u'timezone': u'UTC'},
u'status': u'NS'},
u'venue_id': 273277,
u'visitorteam_id': 18647,
u'weather_report': None,
u'winning_odds_calculated': False},
{u'aggregate_id': None,
u'attendance': None,
u'coaches': {u'localteam_coach_id': 474720, u'visitorteam_coach_id': 474796},
u'commentaries': None,
u'deleted': False,
u'formations': {u'localteam_formation': None,
u'visitorteam_formation': None},
u'group_id': None,
u'id': 10344350,
u'league_id': 732,
u'localteam_id': 18704,
u'pitch': None,
u'referee_id': 16781,
u'round_id': None,
u'scores': {u'et_score': None,
u'ft_score': None,
u'ht_score': None,
u'localteam_pen_score': None,
u'localteam_score': 0,
u'visitorteam_pen_score': None,
u'visitorteam_score': 0},
u'season_id': 892,
u'stage_id': 1728,
u'standings': {u'localteam_position': 1, u'visitorteam_position': 1},
u'time': {u'added_time': None,
u'extra_minute': None,
u'injury_time': None,
u'minute': None,
u'second': None,
u'starting_at': {u'date': u'2018-07-06',
u'date_time': u'2018-07-06 18:00:00',
u'time': u'18:00:00',
u'timestamp': 1530900000,
u'timezone': u'UTC'},
u'status': u'NS'},
u'venue_id': 8869,
u'visitorteam_id': 18743,
u'weather_report': None,
u'winning_odds_calculated': False}]
打印(json.dumps(ts))
Out[44]: '[{"standings": {"visitorteam_position": 1, "localteam_position": 1}, "season_id": 892, "pitch": null, "commentaries": null, "id": 10342083, "venue_id": 273277, "formations": {"localteam_formation": null, "visitorteam_formation": null}, "aggregate_id": null, "round_id": null, "visitorteam_id": 18647, "winning_odds_calculated": false, "deleted": false, "coaches": {"localteam_coach_id": 472158, "visitorteam_coach_id": 474616}, "attendance": null, "scores": {"ft_score": null, "visitorteam_score": 0, "et_score": null, "localteam_pen_score": null, "visitorteam_pen_score": null, "localteam_score": 0, "ht_score": null}, "referee_id": 18783, "stage_id": 1728, "weather_report": null, "league_id": 732, "localteam_id": 15251, "time": {"status": "NS", "starting_at": {"date": "2018-07-06", "date_time": "2018-07-06 14:00:00", "timezone": "UTC", "timestamp": 1530885600, "time": "14:00:00"}, "extra_minute": null, "injury_time": null, "second": null, "added_time": null, "minute": null}, "group_id": null}, {"standings": {"visitorteam_position": 1, "localteam_position": 1}, "season_id": 892, "pitch": null, "commentaries": null, "id": 10344350, "venue_id": 8869, "formations": {"localteam_formation": null, "visitorteam_formation": null}, "aggregate_id": null, "round_id": null, "visitorteam_id": 18743, "winning_odds_calculated": false, "deleted": false, "coaches": {"localteam_coach_id": 474720, "visitorteam_coach_id": 474796}, "attendance": null, "scores": {"ft_score": null, "visitorteam_score": 0, "et_score": null, "localteam_pen_score": null, "visitorteam_pen_score": null, "localteam_score": 0, "ht_score": null}, "referee_id": 16781, "stage_id": 1728, "weather_report": null, "league_id": 732, "localteam_id": 18704, "time": {"status": "NS", "starting_at": {"date": "2018-07-06", "date_time": "2018-07-06 18:00:00", "timezone": "UTC", "timestamp": 1530900000, "time": "18:00:00"}, "extra_minute": null, "injury_time": null, "second": null, "added_time": null, "minute": null}, "group_id": null}]'
提前致谢!
PS。 - 这是 link 如何使用 Scala 做到这一点 - http://spark.apache.org/docs/2.2.0/sql-programming-guide.html#tab_scala_5
你说
I am trying to convert it to a dataframe directly from a variable instead of a JSON file upload; mainly because I get the JSON data from a GET request to an API.
所以我假设 ts 是一个像
这样的变量ts = """[{"standings": {"visitorteam_position": 1, "localteam_position": 1}, "season_id": 892, "pitch": null, "commentaries": null, "id": 10342083, "venue_id": 273277, "formations": {"localteam_formation": null, "visitorteam_formation": null}, "aggregate_id": null, "round_id": null, "visitorteam_id": 18647, "winning_odds_calculated": false, "deleted": false, "coaches": {"localteam_coach_id": 472158, "visitorteam_coach_id": 474616}, "attendance": null, "scores": {"ft_score": null, "visitorteam_score": 0, "et_score": null, "localteam_pen_score": null, "visitorteam_pen_score": null, "localteam_score": 0, "ht_score": null}, "referee_id": 18783, "stage_id": 1728, "weather_report": null, "league_id": 732, "localteam_id": 15251, "time": {"status": "NS", "starting_at": {"date": "2018-07-06", "date_time": "2018-07-06 14:00:00", "timezone": "UTC", "timestamp": 1530885600, "time": "14:00:00"}, "extra_minute": null, "injury_time": null, "second": null, "added_time": null, "minute": null}, "group_id": null}, {"standings": {"visitorteam_position": 1, "localteam_position": 1}, "season_id": 892, "pitch": null, "commentaries": null, "id": 10344350, "venue_id": 8869, "formations": {"localteam_formation": null, "visitorteam_formation": null}, "aggregate_id": null, "round_id": null, "visitorteam_id": 18743, "winning_odds_calculated": false, "deleted": false, "coaches": {"localteam_coach_id": 474720, "visitorteam_coach_id": 474796}, "attendance": null, "scores": {"ft_score": null, "visitorteam_score": 0, "et_score": null, "localteam_pen_score": null, "visitorteam_pen_score": null, "localteam_score": 0, "ht_score": null}, "referee_id": 16781, "stage_id": 1728, "weather_report": null, "league_id": 732, "localteam_id": 18704, "time": {"status": "NS", "starting_at": {"date": "2018-07-06", "date_time": "2018-07-06 18:00:00", "timezone": "UTC", "timestamp": 1530900000, "time": "18:00:00"}, "extra_minute": null, "injury_time": null, "second": null, "added_time": null, "minute": null}, "group_id": null}]"""
现在,json.dumps(ts)
会给你一个字符串,.json(json.dumps(ts))
将 json.dumps(ts)
视为路径,这就是错误消息提示你的内容
IllegalArgumentException: u'java.net.URISyntaxException: Relative path in absolute URI: "[{\"standings\":%20%7B%5C%22visitorteam_position%5C%22:%201,%20%5C%22localteam_position%5C%22:%201%7D,%20%5C%22season_id%5C%22:%20892,%20%5C
并且 API 文档说明如下
.... :param path: string represents path to the JSON dataset, or a list of paths, or RDD of Strings storing JSON objects. .......
因此,如果您想使用变量 ts
,那么,如 API 文档所述,您必须将字符串 json.dumps(ts)
转换为 RDD
,如
tsRDD = sc.parallelize([ts])
df = spark.read.option('multiline', "true").json(tsRDD)
应该给出正确的数据框
+------------+----------+----------------+------------+-------+----------+--------+--------+---------+------------+-----+----------+--------+------------+---------+--------+---------+------------------------------------------------------------------------+--------+--------------+--------------+-----------------------+
|aggregate_id|attendance|coaches |commentaries|deleted|formations|group_id|id |league_id|localteam_id|pitch|referee_id|round_id|scores |season_id|stage_id|standings|time |venue_id|visitorteam_id|weather_report|winning_odds_calculated|
+------------+----------+----------------+------------+-------+----------+--------+--------+---------+------------+-----+----------+--------+------------+---------+--------+---------+------------------------------------------------------------------------+--------+--------------+--------------+-----------------------+
|null |null |[472158, 474616]|null |false |[,] |null |10342083|732 |15251 |null |18783 |null |[,,,, 0,, 0]|892 |1728 |[1, 1] |[,,,,, [2018-07-06, 2018-07-06 14:00:00, 14:00:00, 1530885600, UTC], NS]|273277 |18647 |null |false |
|null |null |[474720, 474796]|null |false |[,] |null |10344350|732 |18704 |null |16781 |null |[,,,, 0,, 0]|892 |1728 |[1, 1] |[,,,,, [2018-07-06, 2018-07-06 18:00:00, 18:00:00, 1530900000, UTC], NS]|8869 |18743 |null |false |
+------------+----------+----------------+------------+-------+----------+--------+--------+---------+------------+-----+----------+--------+------------+---------+--------+---------+------------------------------------------------------------------------+--------+--------------+--------------+-----------------------+
root
|-- aggregate_id: string (nullable = true)
|-- attendance: string (nullable = true)
|-- coaches: struct (nullable = true)
| |-- localteam_coach_id: long (nullable = true)
| |-- visitorteam_coach_id: long (nullable = true)
|-- commentaries: string (nullable = true)
|-- deleted: boolean (nullable = true)
|-- formations: struct (nullable = true)
| |-- localteam_formation: string (nullable = true)
| |-- visitorteam_formation: string (nullable = true)
|-- group_id: string (nullable = true)
|-- id: long (nullable = true)
|-- league_id: long (nullable = true)
|-- localteam_id: long (nullable = true)
|-- pitch: string (nullable = true)
|-- referee_id: long (nullable = true)
|-- round_id: string (nullable = true)
|-- scores: struct (nullable = true)
| |-- et_score: string (nullable = true)
| |-- ft_score: string (nullable = true)
| |-- ht_score: string (nullable = true)
| |-- localteam_pen_score: string (nullable = true)
| |-- localteam_score: long (nullable = true)
| |-- visitorteam_pen_score: string (nullable = true)
| |-- visitorteam_score: long (nullable = true)
|-- season_id: long (nullable = true)
|-- stage_id: long (nullable = true)
|-- standings: struct (nullable = true)
| |-- localteam_position: long (nullable = true)
| |-- visitorteam_position: long (nullable = true)
|-- time: struct (nullable = true)
| |-- added_time: string (nullable = true)
| |-- extra_minute: string (nullable = true)
| |-- injury_time: string (nullable = true)
| |-- minute: string (nullable = true)
| |-- second: string (nullable = true)
| |-- starting_at: struct (nullable = true)
| | |-- date: string (nullable = true)
| | |-- date_time: string (nullable = true)
| | |-- time: string (nullable = true)
| | |-- timestamp: long (nullable = true)
| | |-- timezone: string (nullable = true)
| |-- status: string (nullable = true)
|-- venue_id: long (nullable = true)
|-- visitorteam_id: long (nullable = true)
|-- weather_report: string (nullable = true)
|-- winning_odds_calculated: boolean (nullable = true)
或者您可以将变量保存在文件中并使用
df = spark.read.option('multiline', "true").json(path to the file)
与上述建议一样完美
希望回答对你有帮助