读取 Pandas 数据框中的嵌套 json 文件
Reading Nested json File in Pandas Dataframe
我有一个具有以下结构的 JSON 文件(它不是完整的 json 文件,但结构相同):
{"data":[{"referenced_tweets":[{"type":"retweeted","id":"xxxxxxx"}],"text":"abcdefghijkl","created_at":"2020-03-09T00:11:41.000Z","author_id":"xxxxx","id":"xxxxxxxxxxx"},{"referenced_tweets":[{"type":"retweeted","id":"xxxxxxxxxxxx"}],"text":"abcdefghijkl","created_at":"2020-03-09T00:11:41.000Z","author_id":"xxxxxxxx","id":"xxxxxxxxxxx"}]}
.....
//The rest of json continues with the same structure, but referenced_tweets is not always present
我的问题:如何将此数据加载到包含以下列的数据框中:type
、id(referenced_tweet id)
、text
、created_at
、author_id
和 id (tweet id)
?
到目前为止我能做什么:我能得到以下列:
referenced_tweets
text
cerated_at
author_id
id (tweet id)
[{'type': 'xx', 'id': 'xxx'}]
xxx
xxxx
xxxxx
xxxxxxxxxxxx
这里是获取上面的代码table:
with open('Test_SampleRetweets.json') as json_file:
data_list = json.load(json_file)
df1 = json_normalize(data_list, 'data')
df1.head()
但是,我想在单独的列中获取 type
和 id
(在 referenced_tweets 中),到目前为止我可以获得以下内容:
type
id (referenced_tweet id)
xxxx
xxxxxxxxxxxxxxxxxxxxxxx
这里是获取上述内容的代码 table:
df2 = json_normalize(data_list, record_path=['data','referenced_tweets'], errors='ignore')
df2.head()
问题是什么? 我想把所有东西都集中在一个 table 中,即 table 类似于这里的第一个,但是type
和 id
在不同的列中(如第二个 table)。因此,最后一列应该是:type
、id (referenced_tweet id)
、text
、created_at
、author_id
和 id (tweet id)
感谢任何帮助
谢谢
import pandas as pd
with open('Test_SampleRetweets.json') as json_file:
raw_data = json.load(json_file)
data = []
for item in raw_data["data"]:
item["tweet_id"] = item["id"]
item.update(item["referenced_tweets"][0])
del item["referenced_tweets"]
data.append(item)
df1 = pd.DataFrame(data)
print(df1.head())
在 json_normalize()
中使用嵌套的 json 时,您需要使用 meta
参数来获取元级别中的字段。所以,基本上你正在做的是获取嵌套并对其进行规范化,而不是从更高级别加入其他几个字段。显然,您可以将其组合用于多个嵌套字段,请参阅 以供参考。
import json
import pandas as pd
j = '{"data":[{"referenced_tweets":[{"type":"retweeted","id":"xxxxxxx"}],"text":"abcdefghijkl","created_at":"2020-03-09T00:11:41.000Z","author_id":"xxxxx","id":"xxxxxxxxxxx"},{"referenced_tweets":[{"type":"retweeted","id":"xxxxxxxxxxxx"}],"text":"abcdefghijkl","created_at":"2020-03-09T00:11:41.000Z","author_id":"xxxxxxxx","id":"xxxxxxxxxxx"}]}'
j = json.loads(j)
# since you have id twice, it's a bit more complicated and you need to
# introduce a meta prefix
df = pd.json_normalize(
j,
record_path=["data", 'referenced_tweets'],
meta_prefix="data.",
meta=[["data", "text"], ["data", "created_at"], ["data", "author_id"], ["data", "id"]]
)
print(df)
导致:
type id data.data.text data.data.created_at \
0 retweeted xxxxxxx abcdefghijkl 2020-03-09T00:11:41.000Z
1 retweeted xxxxxxxxxxxx abcdefghijkl 2020-03-09T00:11:41.000Z
data.data.author_id data.data.id
0 xxxxx xxxxxxxxxxx
1 xxxxxxxx xxxxxxxxxx
我更喜欢这种方式,因为它看起来更容易处理
df = pd.json_normalize(
j["data"],
record_path=['referenced_tweets'],
meta_prefix="data.",
meta=["text", "created_at", "author_id", "id"]
)
print(df)
导致:
type id data.text data.created_at \
0 retweeted xxxxxxx abcdefghijkl 2020-03-09T00:11:41.000Z
1 retweeted xxxxxxxxxxxx abcdefghijkl 2020-03-09T00:11:41.000Z
data.author_id data.id
0 xxxxx xxxxxxxxxxx
1 xxxxxxxx xxxxxxxxxxx
我有一个具有以下结构的 JSON 文件(它不是完整的 json 文件,但结构相同):
{"data":[{"referenced_tweets":[{"type":"retweeted","id":"xxxxxxx"}],"text":"abcdefghijkl","created_at":"2020-03-09T00:11:41.000Z","author_id":"xxxxx","id":"xxxxxxxxxxx"},{"referenced_tweets":[{"type":"retweeted","id":"xxxxxxxxxxxx"}],"text":"abcdefghijkl","created_at":"2020-03-09T00:11:41.000Z","author_id":"xxxxxxxx","id":"xxxxxxxxxxx"}]}
.....
//The rest of json continues with the same structure, but referenced_tweets is not always present
我的问题:如何将此数据加载到包含以下列的数据框中:type
、id(referenced_tweet id)
、text
、created_at
、author_id
和 id (tweet id)
?
到目前为止我能做什么:我能得到以下列:
referenced_tweets | text | cerated_at | author_id | id (tweet id) |
---|---|---|---|---|
[{'type': 'xx', 'id': 'xxx'}] | xxx | xxxx | xxxxx | xxxxxxxxxxxx |
这里是获取上面的代码table:
with open('Test_SampleRetweets.json') as json_file:
data_list = json.load(json_file)
df1 = json_normalize(data_list, 'data')
df1.head()
但是,我想在单独的列中获取 type
和 id
(在 referenced_tweets 中),到目前为止我可以获得以下内容:
type | id (referenced_tweet id) |
---|---|
xxxx | xxxxxxxxxxxxxxxxxxxxxxx |
这里是获取上述内容的代码 table:
df2 = json_normalize(data_list, record_path=['data','referenced_tweets'], errors='ignore')
df2.head()
问题是什么? 我想把所有东西都集中在一个 table 中,即 table 类似于这里的第一个,但是type
和 id
在不同的列中(如第二个 table)。因此,最后一列应该是:type
、id (referenced_tweet id)
、text
、created_at
、author_id
和 id (tweet id)
感谢任何帮助
谢谢
import pandas as pd
with open('Test_SampleRetweets.json') as json_file:
raw_data = json.load(json_file)
data = []
for item in raw_data["data"]:
item["tweet_id"] = item["id"]
item.update(item["referenced_tweets"][0])
del item["referenced_tweets"]
data.append(item)
df1 = pd.DataFrame(data)
print(df1.head())
在 json_normalize()
中使用嵌套的 json 时,您需要使用 meta
参数来获取元级别中的字段。所以,基本上你正在做的是获取嵌套并对其进行规范化,而不是从更高级别加入其他几个字段。显然,您可以将其组合用于多个嵌套字段,请参阅
import json
import pandas as pd
j = '{"data":[{"referenced_tweets":[{"type":"retweeted","id":"xxxxxxx"}],"text":"abcdefghijkl","created_at":"2020-03-09T00:11:41.000Z","author_id":"xxxxx","id":"xxxxxxxxxxx"},{"referenced_tweets":[{"type":"retweeted","id":"xxxxxxxxxxxx"}],"text":"abcdefghijkl","created_at":"2020-03-09T00:11:41.000Z","author_id":"xxxxxxxx","id":"xxxxxxxxxxx"}]}'
j = json.loads(j)
# since you have id twice, it's a bit more complicated and you need to
# introduce a meta prefix
df = pd.json_normalize(
j,
record_path=["data", 'referenced_tweets'],
meta_prefix="data.",
meta=[["data", "text"], ["data", "created_at"], ["data", "author_id"], ["data", "id"]]
)
print(df)
导致:
type id data.data.text data.data.created_at \
0 retweeted xxxxxxx abcdefghijkl 2020-03-09T00:11:41.000Z
1 retweeted xxxxxxxxxxxx abcdefghijkl 2020-03-09T00:11:41.000Z
data.data.author_id data.data.id
0 xxxxx xxxxxxxxxxx
1 xxxxxxxx xxxxxxxxxx
我更喜欢这种方式,因为它看起来更容易处理
df = pd.json_normalize(
j["data"],
record_path=['referenced_tweets'],
meta_prefix="data.",
meta=["text", "created_at", "author_id", "id"]
)
print(df)
导致:
type id data.text data.created_at \
0 retweeted xxxxxxx abcdefghijkl 2020-03-09T00:11:41.000Z
1 retweeted xxxxxxxxxxxx abcdefghijkl 2020-03-09T00:11:41.000Z
data.author_id data.id
0 xxxxx xxxxxxxxxxx
1 xxxxxxxx xxxxxxxxxxx