使用 ijson 从特定键读取 json 数据
Using ijson to read json data from a specific key
我有几个大的 json 文件正在尝试加载到 pandas 数据帧中。我发现在 Python 中处理大型 json 的典型方法是使用 ijson 模块。 jsons 我代表地理定位的推文 ID。我只对来自美国的推文 ID 感兴趣。 json 数据如下所示:
{
"tweet_id": "1223655173056356353",
"created_at": "Sat Feb 01 17:11:42 +0000 2020",
"user_id": "3352471150",
"geo_source": "user_location",
"user_location": {
"country_code": "br"
},
"geo": {},
"place": {
},
"tweet_locations": [
{
"country_code": "it",
"state": "Trentino-Alto",
"county": "Pustertal - Val Pusteria"
},
{
"country_code": "us"
},
{
"country_code": "ru",
"state": "Voronezh Oblast",
"county": "Petropavlovsky District"
},
{
"country_code": "at",
"state": "Upper Austria",
"county": "Braunau am Inn"
},
{
"country_code": "it",
"state": "Trentino-Alto",
"county": "Pustertal - Val Pusteria"
},
{
"country_code": "cn"
},
{
"country_code": "in",
"state": "Himachal Pradesh",
"county": "Jubbal"
}
]
}
我如何使用 ijson 来 select 仅来自美国的推文 ID,然后将这些美国 ID 放入数据框? ijson 模块对我来说是新的,我不知道如何处理这个任务。更具体地说,我想获取所有推文 ID,以便 user_location
中的国家代码是美国,或者 tweet_locations
中的国家代码是美国。感谢所有帮助!
使用pandas.json_normalize
- 将半结构化 JSON 数据标准化为平面 table。
data
是你的 JSON dict
- Pandas: Indexing and selecting data
- 数据:Tweets with geo info (English)(选1)
- 每个文件包含多行字典。
- 它们不在列表或元组中,因此读取每一行。
tweet_locations
的值是字典列表
user_location
的值是字典
- 对于
tweet_locations
是空列表的情况,[]
而不是 [{}]
,由于 json_normalize
期望的方式,该行不包含在数据框中查看 metadata
字段。
- 来自
{"tweet_id":"1256223765513584641","created_at":"Fri May 01 14:07:39 +0000 2020","user_id":"772487185031311360","geo_source":"user_location","user_location":{"country_code":"us"},"geo":{},"place":{},"tweet_locations":[]}
的 tweet_id
将 不会 包含在数据中。
- 这可以通过在
"tweet_locations":[]
为 True
时设置 "tweet_locations" = [{}]
来解决
import pandas as pd
import json
from pathlib import Path
# path to file, which contains the sample data at the bottom of this answer
file = Path('data/test.json') # some path to your file
# load file
data = list()
with file.open('r') as f:
for line in f: # the file is rows of dicts that must be read 1 at a time
data.append(json.loads(line))
# create dataframe
df = pd.json_normalize(data, 'tweet_locations', ['tweet_id', ['user_location', 'country_code']], errors='ignore')
# display(df.head())
country_code state county city tweet_id user_location.country_code
0 us Illinois McLean County Normal 1256223753220034566 NaN
1 ke Kiambu County NaN NaN 1256223748904161280 ca
2 us Illinois McLean County Normal 1256223744122593287 us
3 th Saraburi Province NaN NaN 1256223753463365632 NaN
4 in Assam Lanka NaN 1256223753463365632 NaN br
# filter for US in the two columns
us = df[(df.country_code == 'us') | (df['user_location.country_code'] == 'us')]
# display(us)
country_code state county city tweet_id user_location.country_code
0 us Illinois McLean County Normal 1256223753220034566 NaN
2 us Illinois McLean County Normal 1256223744122593287 us
15 us Michigan Sanilac County NaN 1256338355106672640 in
16 us West Virginia Clay County NaN 1256338355106672640 in
18 us Florida Taylor County NaN 1256338355106672640 in
# get unique tweet_id
df_tweet_ids = df.tweet_id.unique().tolist()
print(df_tweet_ids)
['1256223753220034566', '1256223744122593287', '1256338355106672640']
加载和解析所有 JSON 个文件
- 绝不会一次完全加载一个以上的文件
- 使用pandas.concat合并dataframes列表,
us_data
# path to files
p = Path('c:/path_to_files')
# get of all json files
files = list(p.rglob('*.json'))
# parse files
us_data = list()
for file in files:
data = list()
with file.open('r', encoding='utf-8') as f:
for line in f:
data.append(json.loads(line))
# create dataframe
df = pd.json_normalize(data, 'tweet_locations', ['tweet_id', ['user_location', 'country_code']], errors='ignore')
# filter for US in the two columns
df = df[(df.country_code == 'us') | (df['user_location.country_code'] == 'us')]
us_data.append(df)
# combine all data into one dataframe
us = pd.concat(us_data)
# delete objects that are no longer needed
del(data)
del(df)
del(us_data)
只解析 tweet_id
而没有 pandas
- 因为文件是成行的字典,所以不需要
ijson
。
- 如所写,如果
country_code
为 'us'
,这将包括 tweet_id
,即使 tweet_locations
为空列表。
- 来自
{"tweet_id":"1256223765513584641","created_at":"Fri May 01 14:07:39 +0000 2020","user_id":"772487185031311360","geo_source":"user_location","user_location":{"country_code":"us"},"geo":{},"place":{},"tweet_locations":[]}
的 tweet_id
将包含在数据中。
file = Path('data/en_geo_2020-05-01/en_geo_2020-05-01.json')
tweet_ids = list()
with file.open('r') as f:
for line in f:
line = json.loads(line)
if line.get('user_location').get('country_code') == 'us':
tweet_ids.append(line.get('tweet_id'))
else:
if line['tweet_locations']: # if tweet_locations is a list and not empty (None)
tweet_locations_country_code = [i.get('country_code') for i in line['tweet_locations']] # get the coutry_code for each tweet
if 'us' in tweet_locations_country_code: # if 'us' is in the list
tweet_ids.append(line.get('tweet_id')) # append
print(tweet_ids)
['1256223753220034566', '1256223744122593287', '1256338355106672640']
示例数据
- 数据是文件中的字典行
{"tweet_id":"1256223753220034566","created_at":"Fri May 01 14:07:36 +0000 2020","user_id":"916540973190078465","geo_source":"tweet_text","user_location":{},"geo":{},"place":{},"tweet_locations":[{"country_code":"us","state":"Illinois","county":"McLean County","city":"Normal"}]}
{"tweet_id":"1256223748904161280","created_at":"Fri May 01 14:07:35 +0000 2020","user_id":"697426379583983616","geo_source":"user_location","user_location":{"country_code":"ca"},"geo":{},"place":{},"tweet_locations":[{"country_code":"ke","state":"Kiambu County"}]}
{"tweet_id":"1256223744122593287","created_at":"Fri May 01 14:07:34 +0000 2020","user_id":"1277481013","geo_source":"user_location","user_location":{"country_code":"us","state":"Florida"},"geo":{},"place":{},"tweet_locations":[{"country_code":"us","state":"Illinois","county":"McLean County","city":"Normal"}]}
{"tweet_id":"1256223753463365632","created_at":"Fri May 01 14:07:36 +0000 2020","user_id":"596005899","geo_source":"tweet_text","user_location":{},"geo":{},"place":{},"tweet_locations":[{"country_code":"th","state":"Saraburi Province"},{"country_code":"in","state":"Assam","county":"Lanka"},{"country_code":"cz","state":"Northeast","county":"okres \u00dast\u00ed nad Orlic\u00ed"},{"country_code":"lk"}]}
{"tweet_id":"1256223753115238406","created_at":"Fri May 01 14:07:36 +0000 2020","user_id":"139159502","geo_source":"user_location","user_location":{"country_code":"ca"},"geo":{},"place":{},"tweet_locations":[{"country_code":"ve"},{"country_code":"ca","state":"Nova Scotia","county":"Pictou County","city":"Diamond"},{"country_code":"my","state":"Selangor","city":"Kajang"}]}
{"tweet_id":"1256223748161757190","created_at":"Fri May 01 14:07:35 +0000 2020","user_id":"1655021437","geo_source":"user_location","user_location":{"country_code":"af","state":"Nangarhar","county":"Kot"},"geo":{},"place":{},"tweet_locations":[{"country_code":"cz","state":"Northeast","county":"okres \u00dast\u00ed nad Orlic\u00ed"},{"country_code":"cz","state":"Northeast","county":"okres \u00dast\u00ed nad Orlic\u00ed"},{"country_code":"gb","state":"England","county":"Gloucestershire"}]}
{"tweet_id":"1256223749214437380","created_at":"Fri May 01 14:07:35 +0000 2020","user_id":"3244990814","geo_source":"user_location","user_location":{"country_code":"se"},"geo":{},"place":{},"tweet_locations":[{"country_code":"cg","state":"Kouilou","county":"Pointe-Noire"},{"country_code":"cn"}]}
{"tweet_id":"1256338355106672640","created_at":"Fri May 01 21:43:00 +0000 2020","user_id":"1205700416123486208","geo_source":"user_location","user_location":{"country_code":"in","state":"Delhi"},"geo":{},"place":{},"tweet_locations":[{"country_code":"us","state":"Michigan","county":"Sanilac County"},{"country_code":"us","state":"West Virginia","county":"Clay County"},{"country_code":"de","state":"Baden-W\u00fcrttemberg","county":"Verwaltungsgemeinschaft Friedrichshafen"},{"country_code":"us","state":"Florida","county":"Taylor County"}]}
{"tweet_id":"1256223764980944904","created_at":"Fri May 01 14:07:39 +0000 2020","user_id":"1124447266205503488","geo_source":"none","user_location":{},"geo":{},"place":{},"tweet_locations":[]}
{"tweet_id":"1256223760765595650","created_at":"Fri May 01 14:07:38 +0000 2020","user_id":"909477905737990144","geo_source":"tweet_text","user_location":{},"geo":{},"place":{},"tweet_locations":[{"country_code":"lr","state":"Grand Bassa County","county":"District # 2"}]}
我有几个大的 json 文件正在尝试加载到 pandas 数据帧中。我发现在 Python 中处理大型 json 的典型方法是使用 ijson 模块。 jsons 我代表地理定位的推文 ID。我只对来自美国的推文 ID 感兴趣。 json 数据如下所示:
{
"tweet_id": "1223655173056356353",
"created_at": "Sat Feb 01 17:11:42 +0000 2020",
"user_id": "3352471150",
"geo_source": "user_location",
"user_location": {
"country_code": "br"
},
"geo": {},
"place": {
},
"tweet_locations": [
{
"country_code": "it",
"state": "Trentino-Alto",
"county": "Pustertal - Val Pusteria"
},
{
"country_code": "us"
},
{
"country_code": "ru",
"state": "Voronezh Oblast",
"county": "Petropavlovsky District"
},
{
"country_code": "at",
"state": "Upper Austria",
"county": "Braunau am Inn"
},
{
"country_code": "it",
"state": "Trentino-Alto",
"county": "Pustertal - Val Pusteria"
},
{
"country_code": "cn"
},
{
"country_code": "in",
"state": "Himachal Pradesh",
"county": "Jubbal"
}
]
}
我如何使用 ijson 来 select 仅来自美国的推文 ID,然后将这些美国 ID 放入数据框? ijson 模块对我来说是新的,我不知道如何处理这个任务。更具体地说,我想获取所有推文 ID,以便 user_location
中的国家代码是美国,或者 tweet_locations
中的国家代码是美国。感谢所有帮助!
使用pandas.json_normalize
- 将半结构化 JSON 数据标准化为平面 table。
data
是你的 JSON dict- Pandas: Indexing and selecting data
- 数据:Tweets with geo info (English)(选1)
- 每个文件包含多行字典。
- 它们不在列表或元组中,因此读取每一行。
tweet_locations
的值是字典列表user_location
的值是字典
- 对于
tweet_locations
是空列表的情况,[]
而不是[{}]
,由于json_normalize
期望的方式,该行不包含在数据框中查看metadata
字段。- 来自
{"tweet_id":"1256223765513584641","created_at":"Fri May 01 14:07:39 +0000 2020","user_id":"772487185031311360","geo_source":"user_location","user_location":{"country_code":"us"},"geo":{},"place":{},"tweet_locations":[]}
的tweet_id
将 不会 包含在数据中。- 这可以通过在
"tweet_locations":[]
为True
时设置
"tweet_locations" = [{}]
来解决 - 这可以通过在
- 来自
import pandas as pd
import json
from pathlib import Path
# path to file, which contains the sample data at the bottom of this answer
file = Path('data/test.json') # some path to your file
# load file
data = list()
with file.open('r') as f:
for line in f: # the file is rows of dicts that must be read 1 at a time
data.append(json.loads(line))
# create dataframe
df = pd.json_normalize(data, 'tweet_locations', ['tweet_id', ['user_location', 'country_code']], errors='ignore')
# display(df.head())
country_code state county city tweet_id user_location.country_code
0 us Illinois McLean County Normal 1256223753220034566 NaN
1 ke Kiambu County NaN NaN 1256223748904161280 ca
2 us Illinois McLean County Normal 1256223744122593287 us
3 th Saraburi Province NaN NaN 1256223753463365632 NaN
4 in Assam Lanka NaN 1256223753463365632 NaN br
# filter for US in the two columns
us = df[(df.country_code == 'us') | (df['user_location.country_code'] == 'us')]
# display(us)
country_code state county city tweet_id user_location.country_code
0 us Illinois McLean County Normal 1256223753220034566 NaN
2 us Illinois McLean County Normal 1256223744122593287 us
15 us Michigan Sanilac County NaN 1256338355106672640 in
16 us West Virginia Clay County NaN 1256338355106672640 in
18 us Florida Taylor County NaN 1256338355106672640 in
# get unique tweet_id
df_tweet_ids = df.tweet_id.unique().tolist()
print(df_tweet_ids)
['1256223753220034566', '1256223744122593287', '1256338355106672640']
加载和解析所有 JSON 个文件
- 绝不会一次完全加载一个以上的文件
- 使用pandas.concat合并dataframes列表,
us_data
# path to files
p = Path('c:/path_to_files')
# get of all json files
files = list(p.rglob('*.json'))
# parse files
us_data = list()
for file in files:
data = list()
with file.open('r', encoding='utf-8') as f:
for line in f:
data.append(json.loads(line))
# create dataframe
df = pd.json_normalize(data, 'tweet_locations', ['tweet_id', ['user_location', 'country_code']], errors='ignore')
# filter for US in the two columns
df = df[(df.country_code == 'us') | (df['user_location.country_code'] == 'us')]
us_data.append(df)
# combine all data into one dataframe
us = pd.concat(us_data)
# delete objects that are no longer needed
del(data)
del(df)
del(us_data)
只解析 tweet_id
而没有 pandas
- 因为文件是成行的字典,所以不需要
ijson
。 - 如所写,如果
country_code
为'us'
,这将包括tweet_id
,即使tweet_locations
为空列表。- 来自
{"tweet_id":"1256223765513584641","created_at":"Fri May 01 14:07:39 +0000 2020","user_id":"772487185031311360","geo_source":"user_location","user_location":{"country_code":"us"},"geo":{},"place":{},"tweet_locations":[]}
的tweet_id
将包含在数据中。
- 来自
file = Path('data/en_geo_2020-05-01/en_geo_2020-05-01.json')
tweet_ids = list()
with file.open('r') as f:
for line in f:
line = json.loads(line)
if line.get('user_location').get('country_code') == 'us':
tweet_ids.append(line.get('tweet_id'))
else:
if line['tweet_locations']: # if tweet_locations is a list and not empty (None)
tweet_locations_country_code = [i.get('country_code') for i in line['tweet_locations']] # get the coutry_code for each tweet
if 'us' in tweet_locations_country_code: # if 'us' is in the list
tweet_ids.append(line.get('tweet_id')) # append
print(tweet_ids)
['1256223753220034566', '1256223744122593287', '1256338355106672640']
示例数据
- 数据是文件中的字典行
{"tweet_id":"1256223753220034566","created_at":"Fri May 01 14:07:36 +0000 2020","user_id":"916540973190078465","geo_source":"tweet_text","user_location":{},"geo":{},"place":{},"tweet_locations":[{"country_code":"us","state":"Illinois","county":"McLean County","city":"Normal"}]}
{"tweet_id":"1256223748904161280","created_at":"Fri May 01 14:07:35 +0000 2020","user_id":"697426379583983616","geo_source":"user_location","user_location":{"country_code":"ca"},"geo":{},"place":{},"tweet_locations":[{"country_code":"ke","state":"Kiambu County"}]}
{"tweet_id":"1256223744122593287","created_at":"Fri May 01 14:07:34 +0000 2020","user_id":"1277481013","geo_source":"user_location","user_location":{"country_code":"us","state":"Florida"},"geo":{},"place":{},"tweet_locations":[{"country_code":"us","state":"Illinois","county":"McLean County","city":"Normal"}]}
{"tweet_id":"1256223753463365632","created_at":"Fri May 01 14:07:36 +0000 2020","user_id":"596005899","geo_source":"tweet_text","user_location":{},"geo":{},"place":{},"tweet_locations":[{"country_code":"th","state":"Saraburi Province"},{"country_code":"in","state":"Assam","county":"Lanka"},{"country_code":"cz","state":"Northeast","county":"okres \u00dast\u00ed nad Orlic\u00ed"},{"country_code":"lk"}]}
{"tweet_id":"1256223753115238406","created_at":"Fri May 01 14:07:36 +0000 2020","user_id":"139159502","geo_source":"user_location","user_location":{"country_code":"ca"},"geo":{},"place":{},"tweet_locations":[{"country_code":"ve"},{"country_code":"ca","state":"Nova Scotia","county":"Pictou County","city":"Diamond"},{"country_code":"my","state":"Selangor","city":"Kajang"}]}
{"tweet_id":"1256223748161757190","created_at":"Fri May 01 14:07:35 +0000 2020","user_id":"1655021437","geo_source":"user_location","user_location":{"country_code":"af","state":"Nangarhar","county":"Kot"},"geo":{},"place":{},"tweet_locations":[{"country_code":"cz","state":"Northeast","county":"okres \u00dast\u00ed nad Orlic\u00ed"},{"country_code":"cz","state":"Northeast","county":"okres \u00dast\u00ed nad Orlic\u00ed"},{"country_code":"gb","state":"England","county":"Gloucestershire"}]}
{"tweet_id":"1256223749214437380","created_at":"Fri May 01 14:07:35 +0000 2020","user_id":"3244990814","geo_source":"user_location","user_location":{"country_code":"se"},"geo":{},"place":{},"tweet_locations":[{"country_code":"cg","state":"Kouilou","county":"Pointe-Noire"},{"country_code":"cn"}]}
{"tweet_id":"1256338355106672640","created_at":"Fri May 01 21:43:00 +0000 2020","user_id":"1205700416123486208","geo_source":"user_location","user_location":{"country_code":"in","state":"Delhi"},"geo":{},"place":{},"tweet_locations":[{"country_code":"us","state":"Michigan","county":"Sanilac County"},{"country_code":"us","state":"West Virginia","county":"Clay County"},{"country_code":"de","state":"Baden-W\u00fcrttemberg","county":"Verwaltungsgemeinschaft Friedrichshafen"},{"country_code":"us","state":"Florida","county":"Taylor County"}]}
{"tweet_id":"1256223764980944904","created_at":"Fri May 01 14:07:39 +0000 2020","user_id":"1124447266205503488","geo_source":"none","user_location":{},"geo":{},"place":{},"tweet_locations":[]}
{"tweet_id":"1256223760765595650","created_at":"Fri May 01 14:07:38 +0000 2020","user_id":"909477905737990144","geo_source":"tweet_text","user_location":{},"geo":{},"place":{},"tweet_locations":[{"country_code":"lr","state":"Grand Bassa County","county":"District # 2"}]}