从我们从 Twitter 用户 json 获得的实体中提取扩展 url
Extract expand url from entities which we get from Twitter user json
如何从实体
获取expanded_url
{"blocked_by": false,
"blocking": false,
"contributors_enabled": false,
"created_at": "Mon Dec 27 16:09:18 +0000 2010",
"default_profile": false,
"default_profile_image": false,
"description": "Wealth Management. Banker. Former elite cyclist. 100% galego. Yield hunter. Until debt tear us apart | USC Alumni",
"entities": {
"description": {
"urls": []
},
"url": {
"urls": [
{
"display_url": "fanecabrava.substack.com/?utm_source=ac\u2026",
"expanded_url": "https://fanecabrava.substack.com/?utm_source=account-card&utm_content=writes",
"indices": [
0,
23
],
"url": "shorteners urls i deleted it to post this"
}
]
}
},
"favourites_count": 3808,
"follow_request_sent": false,
"followers_count": 578,
"following": false,
"friends_count": 465,
"geo_enabled": true,
"has_extended_profile": false,
"id": 231102009,
"id_str": "231102009",
"is_translation_enabled": false,
"is_translator": false,
"lang": null,
}
这个 link 有完整的 json 文件
https://drive.google.com/file/d/1zB1mmU5zHbJC6R7ZZESReB0bWGQVTvfx/view?usp=drivesdk
我只想要扩展的 url 将其存储到 csv 文件中
您可以结合使用 read_json()
和 apply()
。
import pandas as pd
import requests
data = pd.read_json('json.json')
print(data)
# alternative 1: from description section
urls = data['entities'].apply(lambda x: (x['description']['urls'
][0]['expanded_url'
] if len(x['description']['urls'])
> 0 else pd.NA))
urls = urls[urls.notna()]
# alternative 2: from url section
urls = data['url']
# expand urls
def expand_url(url):
if url is None:
return ''
r = requests.get(url, allow_redirects=False)
try:
return r.headers['location']
except KeyError:
return ''
expanded_url = urls.apply(expand_url)
输出:
0
1 https://www.sallyturbitt.com/
2 https://johannesdrooghaag.com/
3
4
如何从实体
获取expanded_url{"blocked_by": false,
"blocking": false,
"contributors_enabled": false,
"created_at": "Mon Dec 27 16:09:18 +0000 2010",
"default_profile": false,
"default_profile_image": false,
"description": "Wealth Management. Banker. Former elite cyclist. 100% galego. Yield hunter. Until debt tear us apart | USC Alumni",
"entities": {
"description": {
"urls": []
},
"url": {
"urls": [
{
"display_url": "fanecabrava.substack.com/?utm_source=ac\u2026",
"expanded_url": "https://fanecabrava.substack.com/?utm_source=account-card&utm_content=writes",
"indices": [
0,
23
],
"url": "shorteners urls i deleted it to post this"
}
]
}
},
"favourites_count": 3808,
"follow_request_sent": false,
"followers_count": 578,
"following": false,
"friends_count": 465,
"geo_enabled": true,
"has_extended_profile": false,
"id": 231102009,
"id_str": "231102009",
"is_translation_enabled": false,
"is_translator": false,
"lang": null,
}
这个 link 有完整的 json 文件
https://drive.google.com/file/d/1zB1mmU5zHbJC6R7ZZESReB0bWGQVTvfx/view?usp=drivesdk
我只想要扩展的 url 将其存储到 csv 文件中
您可以结合使用 read_json()
和 apply()
。
import pandas as pd
import requests
data = pd.read_json('json.json')
print(data)
# alternative 1: from description section
urls = data['entities'].apply(lambda x: (x['description']['urls'
][0]['expanded_url'
] if len(x['description']['urls'])
> 0 else pd.NA))
urls = urls[urls.notna()]
# alternative 2: from url section
urls = data['url']
# expand urls
def expand_url(url):
if url is None:
return ''
r = requests.get(url, allow_redirects=False)
try:
return r.headers['location']
except KeyError:
return ''
expanded_url = urls.apply(expand_url)
输出:
0
1 https://www.sallyturbitt.com/
2 https://johannesdrooghaag.com/
3
4