将深度嵌套 JSON 扁平化为 pandas 数据框

Question

我正在尝试将深度嵌套的 JSON 导入 pandas 数据框。这是 JSON 文件的结构（这只是第一条记录 (retweets[:1])）：

[{'lang': 'en',
  'author_id': '1076979440372965377',
  'reply_settings': 'everyone',
  'entities': {'mentions': [{'start': 3,
     'end': 17,
     'username': 'Terry81987010',
     'url': '',
     'location': 'Florida',
     'entities': {'description': {'hashtags': [{'start': 29,
         'end': 32,
         'tag': '2A'}]}},
     'created_at': '2019-02-01T23:01:11.000Z',
     'protected': False,
     'public_metrics': {'followers_count': 520,
      'following_count': 567,
      'tweet_count': 34376,
      'listed_count': 1},
     'name': "Terry's Take",
     'verified': False,
     'id': '1091471553437593605',
     'description': 'Less government more Freedom #2A is a constitutional right. Trump2020, common sense rules, God bless America! Vet 82nd Airborne F/A, proud Republican',
     'profile_image_url': 'https://pbs.twimg.com/profile_images/1289626661911134208/WfztLkr1_normal.jpg'},
    {'start': 19,
     'end': 32,
     'username': 'DineshDSouza',
     'location': 'United States',
     'entities': {'url': {'urls': [{'start': 0,
         'end': 23,
         'expanded_url': 'https://podcasts.apple.com/us/podcast/the-dinesh-dsouza-podcast/id1547827376',
         'display_url': 'podcasts.apple.com/us/podcast/the…'}]},
      'description': {'urls': [{'start': 80,
         'end': 103,
         'expanded_url': 'https://podcasts.apple.com/us/podcast/the-dinesh-dsouza-podcast/id1547827376',
         'display_url': 'podcasts.apple.com/us/podcast/the…'}]}},
     'created_at': '2009-11-22T22:32:41.000Z',
     'protected': False,
     'public_metrics': {'followers_count': 1748832,
      'following_count': 5355,
      'tweet_count': 65674,
      'listed_count': 6966},
     'name': "Dinesh D'Souza",
     'verified': True,
     'pinned_tweet_id': '1393309917239562241',
     'id': '91882544',
     'description': "I am an author, filmmaker, and host of the Dinesh D'Souza Podcast.\n\nSubscribe: ",
     'profile_image_url': 'https://pbs.twimg.com/profile_images/890967538292711424/8puyFbiI_normal.jpg'}]},
  'conversation_id': '1253462541881106433',
  'created_at': '2020-04-23T23:15:32.000Z',
  'id': '1253462541881106433',
  'possibly_sensitive': False,
  'referenced_tweets': [{'type': 'retweeted',
    'id': '1253052684489437184',
    'in_reply_to_user_id': '91882544',
    'attachments': {'media_keys': ['3_1253052312144293888',
      '3_1253052620937277442'],
     'media': [{}, {}]},
    'entities': {'annotations': [{'start': 126,
       'end': 128,
       'probability': 0.514,
       'type': 'Organization',
       'normalized_text': 'CDC'},
      {'start': 145,
       'end': 146,
       'probability': 0.5139,
       'type': 'Place',
       'normalized_text': 'NY'}],
     'mentions': [{'start': 0,
       'end': 13,
       'username': 'DineshDSouza',
       'location': 'United States',
       'entities': {'url': {'urls': [{'start': 0,
           'end': 23,
           'expanded_url': 'https://podcasts.apple.com/us/podcast/the-dinesh-dsouza-podcast/id1547827376',
           'display_url': 'podcasts.apple.com/us/podcast/the…'}]},
        'description': {'urls': [{'start': 80,
           'end': 103,
           'expanded_url': 'https://podcasts.apple.com/us/podcast/the-dinesh-dsouza-podcast/id1547827376',
           'display_url': 'podcasts.apple.com/us/podcast/the…'}]}},
       'created_at': '2009-11-22T22:32:41.000Z',
       'protected': False,
       'public_metrics': {'followers_count': 1748832,
        'following_count': 5355,
        'tweet_count': 65674,
        'listed_count': 6966},
       'name': "Dinesh D'Souza",
       'verified': True,
       'pinned_tweet_id': '1393309917239562241',
       'id': '91882544',
       'description': "I am an author, filmmaker, and host of the Dinesh D'Souza Podcast.\n\nSubscribe:  ",
       'profile_image_url': 'https://pbs.twimg.com/profile_images/890967538292711424/8puyFbiI_normal.jpg'}],
     'urls': [{'start': 187,
       'end': 210,
       'expanded_url': 'https://twitter.com/Terry81987010/status/1253052684489437184/photo/1',
       'display_url': 'pic.twitter.com/H4NpN5ZMkW'},
      {'start': 187,
       'end': 210,
       'expanded_url': 'https://twitter.com/Terry81987010/status/1253052684489437184/photo/1',
       'display_url': 'pic.twitter.com/H4NpN5ZMkW'}]},
    'lang': 'en',
    'author_id': '1091471553437593605',
    'reply_settings': 'everyone',
    'conversation_id': '1253050942716551168',
    'created_at': '2020-04-22T20:06:55.000Z',
    'possibly_sensitive': False,
    'referenced_tweets': [{'type': 'replied_to', 'id': '1253050942716551168'}],
    'public_metrics': {'retweet_count': 208,
     'reply_count': 57,
     'like_count': 402,
     'quote_count': 38},
    'source': 'Twitter Web App',
    'text': "@DineshDSouza Here's some proof of artificially inflating the cv deaths. Noone is dying of pneumonia anymore according to the CDC. And of course NY getting paid for each cv death ,000",
    'context_annotations': [{'domain': {'id': '10',
       'name': 'Person',
       'description': 'Named people in the world like Nelson Mandela'},
      'entity': {'id': '1138120064119369729', 'name': "Dinesh D'Souza"}},
     {'domain': {'id': '35',
       'name': 'Politician',
       'description': 'Politicians in the world, like Joe Biden'},
      'entity': {'id': '1138120064119369729', 'name': "Dinesh D'Souza"}}],
    'author': {'url': '',
     'username': 'Terry81987010',
     'location': 'Florida',
     'entities': {'description': {'hashtags': [{'start': 29,
         'end': 32,
         'tag': '2A'}]}},
     'created_at': '2019-02-01T23:01:11.000Z',
     'protected': False,
     'public_metrics': {'followers_count': 520,
      'following_count': 567,
      'tweet_count': 34376,
      'listed_count': 1},
     'name': "Terry's Take",
     'verified': False,
     'id': '1091471553437593605',
     'description': 'Less government more Freedom #2A is a constitutional right. Trump2020, common sense rules, God bless America! Vet 82nd Airborne F/A, proud Republican',
     'profile_image_url': 'https://pbs.twimg.com/profile_images/1289626661911134208/WfztLkr1_normal.jpg'},
    'in_reply_to_user': {'username': 'DineshDSouza',
     'location': 'United States',
     'entities': {'url': {'urls': [{'start': 0,
         'end': 23,
         'expanded_url': 'https://podcasts.apple.com/us/podcast/the-dinesh-dsouza-podcast/id1547827376',
         'display_url': 'podcasts.apple.com/us/podcast/the…'}]},
      'description': {'urls': [{'start': 80,
         'end': 103,
         'expanded_url': 'https://podcasts.apple.com/us/podcast/the-dinesh-dsouza-podcast/id1547827376',
         'display_url': 'podcasts.apple.com/us/podcast/the…'}]}},
     'created_at': '2009-11-22T22:32:41.000Z',
     'protected': False,
     'public_metrics': {'followers_count': 1748832,
      'following_count': 5355,
      'tweet_count': 65674,
      'listed_count': 6966},
     'name': "Dinesh D'Souza",
     'verified': True,
     'pinned_tweet_id': '1393309917239562241',
     'id': '91882544',
     'description': "I am an author, filmmaker, and host of the Dinesh D'Souza Podcast.\n\nSubscribe:  ",
     'profile_image_url': 'https://pbs.twimg.com/profile_images/890967538292711424/8puyFbiI_normal.jpg'}}],
  'public_metrics': {'retweet_count': 208,
   'reply_count': 0,
   'like_count': 0,
   'quote_count': 0},
  'source': 'Twitter for iPhone',
  'text': "RT @Terry81987010: @DineshDSouza Here's some proof of artificially inflating the cv deaths. Noone is dying of pneumonia anymore according t…",
  'context_annotations': [{'domain': {'id': '10',
     'name': 'Person',
     'description': 'Named people in the world like Nelson Mandela'},
    'entity': {'id': '1138120064119369729', 'name': "Dinesh D'Souza"}},
   {'domain': {'id': '35',
     'name': 'Politician',
     'description': 'Politicians in the world, like Joe Biden'},
    'entity': {'id': '1138120064119369729', 'name': "Dinesh D'Souza"}}],
  'author': {'url': '',
   'username': 'set1952',
   'location': 'Etats-Unis',
   'created_at': '2018-12-23T23:14:42.000Z',
   'protected': False,
   'public_metrics': {'followers_count': 103,
    'following_count': 44,
    'tweet_count': 44803,
    'listed_count': 0},
   'name': 'SunSet1952',
   'verified': False,
   'id': '1076979440372965377',
   'description': '',
   'profile_image_url': 'https://abs.twimg.com/sticky/default_profile_images/default_profile_normal.png'},
  '__twarc': {'url': 'https://api.twitter.com/2/tweets/search/all?expansions=author_id%2Cin_reply_to_user_id%2Creferenced_tweets.id%2Creferenced_tweets.id.author_id%2Centities.mentions.username%2Cattachments.poll_ids%2Cattachments.media_keys%2Cgeo.place_id&user.fields=created_at%2Cdescription%2Centities%2Cid%2Clocation%2Cname%2Cpinned_tweet_id%2Cprofile_image_url%2Cprotected%2Cpublic_metrics%2Curl%2Cusername%2Cverified%2Cwithheld&tweet.fields=attachments%2Cauthor_id%2Ccontext_annotations%2Cconversation_id%2Ccreated_at%2Centities%2Cgeo%2Cid%2Cin_reply_to_user_id%2Clang%2Cpublic_metrics%2Ctext%2Cpossibly_sensitive%2Creferenced_tweets%2Creply_settings%2Csource%2Cwithheld&media.fields=duration_ms%2Cheight%2Cmedia_key%2Cpreview_image_url%2Ctype%2Curl%2Cwidth%2Cpublic_metrics&poll.fields=duration_minutes%2Cend_datetime%2Cid%2Coptions%2Cvoting_status&place.fields=contained_within%2Ccountry%2Ccountry_code%2Cfull_name%2Cgeo%2Cid%2Cname%2Cplace_type&max_results=500&query=retweets_of%3ATerry81987010&start_time=2020-03-09T00%3A00%3A00%2B00%3A00&end_time=2020-04-24T00%3A00%3A00%2B00%3A00',
   'version': '2.0.8',
   'retrieved_at': '2021-05-17T17:13:17+00:00'}},

这是我的代码：

retweets = []
for line in open('Data/usersRetweetsFlatten_sample.json', 'r'):
    retweets.append(json.loads(line))

df = json_normalize(
    retweets, 'referenced_tweets', ['referenced_tweets', 'type'],
    meta_prefix= ".",
    errors='ignore'
)
df[['author_id',  'type', '.type', 'id', 'in_reply_to_user_id', 'referenced_tweets']].head()

这是生成的数据帧：

如您所见，referenced_tweets 列尚未展平（请注意，在我的 JSON 文件中有两个不同的 referenced_tweets 数组：一个在更深层次在另一个“referenced_tweets”内。比如上一级return这个：

 >>> retweets[0]["referenced_tweets"][0]["type"]
"retweeted"

和更深一层return这个：

>>> retweets[0]["referenced_tweets"][0]["referenced_tweets"][0]["type"]
'replied_to'

问题： 我想知道如何才能将更深的部分弄平 referenced_tweets。我想有两个单独的列作为 referenced_tweets.type 和 referenced_tweets.id，其中上例中列 referenced_tweets.type 的值应该是 replied_to.

Answer 1

我认为这里的问题是您的数据是双重嵌套的...referenced_tweets 中有一个键 referenced_tweets。

import json
from pandas import json_normalize

with open("flatten.json", "r") as file:
    data = json.load(file)

df = json_normalize(
    data, 
    record_path=["referenced_tweets", "referenced_tweets"],
    meta=[
        "author_id",
        
        # ["author", "username"], # not possible
        # "author", # possible but not useful
        
        ["referenced_tweets", "id"],
        ["referenced_tweets", "type"],
        ["referenced_tweets", "in_reply_to_user_id"],
        ["referenced_tweets", "in_reply_to_user", "username"], 
    ]
)

print(df)

另请参阅：

注意： 如果缺少第二个嵌套 referenced_tweet，以上代码将失败。

编辑： 或者，您可以通过额外的手动迭代进一步规范化问题中的数据（您已经使用代码部分规范化了数据）。请参见下面的示例。注意：代码未优化，可能会根据数据量变慢。

# load your `data` with `json.load()` or `json.loads()`

df = json_normalize(
    data, 
    record_path="referenced_tweets",
    meta=["referenced_tweets", "type"],
    meta_prefix= ".",
    errors="ignore",
)

columns = [*df.columns, "_type", "_id"]
normalized_data = []

def append(row, type, id):
    normalized_row = [*row.to_list(), type, id]
    normalized_data.append(normalized_row)

for _, row in df.iterrows():
    
    # a list/array is expected
    if type(row["referenced_tweets"]) is list:
        for tweet in row["referenced_tweets"]:
            append(row, tweet["type"], tweet["id"])
        # if empty list
        else:
            append(row, None, None)
    else:
        append(row, None, None)

enhanced_df = pd.DataFrame(data=normalized_data, columns=columns)
enhanced_df.drop("referenced_tweets", 1)

print(enhanced_df)

编辑2： referenced_tweets应该是一个数组。但是，如果没有引用推文，Twitter API 似乎完全省略了 referenced_tweets。在这种情况下，单元格值为 NaN（浮点数）而不是空列表。我更新了上面的代码以考虑到这一点。

将深度嵌套 JSON 扁平化为 pandas 数据框

Flattening deeply nested JSON into pandas data frame

python

nested

pandas

json-flattener

json-normalize