解析键值对的请求响应

Parse request response for key, value pairs

我正在将 POST 对 Instagram API 请求的响应存储到一个文本文件中。此响应中包含的内容是 HTML,其中包含我想挖掘的访问令牌。 HTML 的原因是因为这个 POST 响应实际上是由最终用户处理的,他们单击一个按钮,然后获得访问代码。但是我需要在后端执行此操作,因此需要处理 HTML 响应。

无论如何,到目前为止,这是我的代码(显然 post 隐藏了真实的客户端 ID):

OAuthURL = "https://api.instagram.com/oauth/authorize/?client_id=cb0096f08a3848e65f&redirect_uri=https://www.smashboarddashboard.com/whathappened&response_type=code"
OAuth_AccessRequest = requests.post(OAuthURL).text 
#print OAuth_AccessRequest

with open('response.txt', 'w') as OAuthResponse:
        OAuthResponse.write(OAuth_AccessRequest.encode("UTF-8"))

OAuthReady = open('response.txt', 'r')
OAuthView = OAuthReady.read()
print OAuthView 

我剩下的是 HTML 存储在文本文件中。然而,在 HTML 中有字典,我需要访问它的值,对 - 例如,其中一些看起来像这样:

</div> <!-- .root -->

    <script src=//instagramstatic-a.akamaihd.net/bluebar/422f3d9/scripts/polyfills/es5-shim.min.js></script>
<script src=//instagramstatic-a.akamaihd.net/bluebar/422f3d9/scripts/polyfills/es5-sham.min.js></script>
<script type="text/javascript">window._sharedData = {"static_root":"\/\/instagramstatic-a.akamaihd.net\/bluebar\/422f3d9","entry_data":{},"hostname":"instagram.com","platform":{"is_touch":false,"app_platform":"web"},"qe":{"su":false},"display_properties_server_guess":{"viewport_width":360,"pixel_ratio":1.5},"country_code":"US","language_code":"en","gatekeepers":{"tr":false},"config":{"dismiss_app_install_banner_until":null,"viewer":null,"csrf_token":"2aedabf96ad1fe86fab0"},"environment_switcher_visible_server_guess":true};</script>

    </body>
</html>

这是一串数字,它是我需要抓取的键 "csfr_token" 的值。从存储在 txt 文件中的 HTML 中挖掘出来的最佳方法是什么?

如果csrf_token字符串是整个页面中唯一的这样的字符串,用正则表达式提取它就很简单了:

import re

token_pattern = re.compile(r'"csrf_token":\s*"([^"]+)"')

token = token_pattern.search(requests.post(OAuthURL).content).group(1)

请注意,我使用了响应的 content 属性,当您只需要几个 ASCII 字符时,将整个响应解码为 Unicode 毫无意义.

演示:

>>> import requests, re
>>> token_pattern = re.compile(r'"csrf_token":\s*"([^"]+)"')
>>> OAuthURL = "https://api.instagram.com/oauth/authorize/?client_id=cb0096f08a3848e65f&redirect_uri=https://www.smashboarddashboard.com/whathappened&response_type=code"
>>> token_pattern.search(requests.post(OAuthURL).content).group(1)
'3fd6022ac344c3eaea46e87e258ef9c6'

您可能还想查看响应的 headers 和 cookies; CSRF 令牌通常也设置为 cookie(或至少作为 session 中的值)。

例如,对于此特定请求,令牌也存储为 cookie,与 JavaScript 块中的值匹配:

>>> r = requests.post(OAuthURL)
>>> r.cookies
<RequestsCookieJar[Cookie(version=0, name='csrftoken', value='b2b621c198642e26a19fc9bf1b38d246', port=None, port_specified=False, domain='instagram.com', domain_specified=False, domain_initial_dot=False, path='/', path_specified=True, secure=False, expires=1467828030, discard=False, comment=None, comment_url=None, rest={}, rfc2109=False)]>
>>> r.cookies['csrftoken']
'b2b621c198642e26a19fc9bf1b38d246'
>>> 'b2b621c198642e26a19fc9bf1b38d246' in r.content
True
>>> token_pattern.search(r.content).group(1)
'b2b621c198642e26a19fc9bf1b38d246'