从我们从 Twitter 用户 json 获得的实体中提取扩展 url

Question

如何从实体

获取expanded_url

{"blocked_by": false,
"blocking": false,
"contributors_enabled": false,
"created_at": "Mon Dec 27 16:09:18 +0000 2010",
"default_profile": false,
"default_profile_image": false,
"description": "Wealth Management. Banker. Former elite cyclist. 100% galego. Yield hunter. Until debt tear us apart | USC Alumni",
"entities": {
  "description": {
    "urls": []
  },
  "url": {
    "urls": [
      {
        "display_url": "fanecabrava.substack.com/?utm_source=ac\u2026",
        "expanded_url": "https://fanecabrava.substack.com/?utm_source=account-card&utm_content=writes",
        "indices": [
          0,
          23
        ],
        "url": "shorteners urls i deleted it to post this"
      }
    ]
  }
},
"favourites_count": 3808,
"follow_request_sent": false,
"followers_count": 578,
"following": false,
"friends_count": 465,
"geo_enabled": true,
"has_extended_profile": false,
"id": 231102009,
"id_str": "231102009",
"is_translation_enabled": false,
"is_translator": false,
"lang": null,

}

这个 link 有完整的 json 文件

https://drive.google.com/file/d/1zB1mmU5zHbJC6R7ZZESReB0bWGQVTvfx/view?usp=drivesdk

我只想要扩展的 url 将其存储到 csv 文件中

Answer 1

您可以结合使用 read_json() 和 apply()。

import pandas as pd
import requests
data = pd.read_json('json.json')

print(data)
# alternative 1: from description section
urls = data['entities'].apply(lambda x: (x['description']['urls'
                              ][0]['expanded_url'
                              ] if len(x['description']['urls'])
                              > 0 else pd.NA))
urls = urls[urls.notna()]


# alternative 2: from url section
urls = data['url']

# expand urls
def expand_url(url):
    if url is None:
      return ''

    r = requests.get(url, allow_redirects=False)
    try:
        return r.headers['location']
    except KeyError:
        return ''
expanded_url = urls.apply(expand_url)

输出：

0                                                       
1                          https://www.sallyturbitt.com/
2                         https://johannesdrooghaag.com/
3                                                       
4

从我们从 Twitter 用户 json 获得的实体中提取扩展 url

Extract expand url from entities which we get from Twitter user json

python

json

tweepy

pandas