解析键值对的请求响应

Question

我正在将 POST 对 Instagram API 请求的响应存储到一个文本文件中。此响应中包含的内容是 HTML，其中包含我想挖掘的访问令牌。 HTML 的原因是因为这个 POST 响应实际上是由最终用户处理的，他们单击一个按钮，然后获得访问代码。但是我需要在后端执行此操作，因此需要处理 HTML 响应。

无论如何，到目前为止，这是我的代码（显然 post 隐藏了真实的客户端 ID）：

OAuthURL = "https://api.instagram.com/oauth/authorize/?client_id=cb0096f08a3848e65f&redirect_uri=https://www.smashboarddashboard.com/whathappened&response_type=code"
OAuth_AccessRequest = requests.post(OAuthURL).text 
#print OAuth_AccessRequest

with open('response.txt', 'w') as OAuthResponse:
        OAuthResponse.write(OAuth_AccessRequest.encode("UTF-8"))

OAuthReady = open('response.txt', 'r')
OAuthView = OAuthReady.read()
print OAuthView

我剩下的是 HTML 存储在文本文件中。然而，在 HTML 中有字典，我需要访问它的值，对 - 例如，其中一些看起来像这样：

</div> <!-- .root -->

    <script src=//instagramstatic-a.akamaihd.net/bluebar/422f3d9/scripts/polyfills/es5-shim.min.js></script>
<script src=//instagramstatic-a.akamaihd.net/bluebar/422f3d9/scripts/polyfills/es5-sham.min.js></script>
<script type="text/javascript">window._sharedData = {"static_root":"\/\/instagramstatic-a.akamaihd.net\/bluebar\/422f3d9","entry_data":{},"hostname":"instagram.com","platform":{"is_touch":false,"app_platform":"web"},"qe":{"su":false},"display_properties_server_guess":{"viewport_width":360,"pixel_ratio":1.5},"country_code":"US","language_code":"en","gatekeepers":{"tr":false},"config":{"dismiss_app_install_banner_until":null,"viewer":null,"csrf_token":"2aedabf96ad1fe86fab0"},"environment_switcher_visible_server_guess":true};</script>

    </body>
</html>

这是一串数字，它是我需要抓取的键 "csfr_token" 的值。从存储在 txt 文件中的 HTML 中挖掘出来的最佳方法是什么？

Answer 1

如果csrf_token字符串是整个页面中唯一的这样的字符串，用正则表达式提取它就很简单了：

import re

token_pattern = re.compile(r'"csrf_token":\s*"([^"]+)"')

token = token_pattern.search(requests.post(OAuthURL).content).group(1)

请注意，我使用了响应的 content 属性，当您只需要几个 ASCII 字符时，将整个响应解码为 Unicode 毫无意义.

演示：

>>> import requests, re
>>> token_pattern = re.compile(r'"csrf_token":\s*"([^"]+)"')
>>> OAuthURL = "https://api.instagram.com/oauth/authorize/?client_id=cb0096f08a3848e65f&redirect_uri=https://www.smashboarddashboard.com/whathappened&response_type=code"
>>> token_pattern.search(requests.post(OAuthURL).content).group(1)
'3fd6022ac344c3eaea46e87e258ef9c6'

您可能还想查看响应的 headers 和 cookies； CSRF 令牌通常也设置为 cookie（或至少作为 session 中的值）。

例如，对于此特定请求，令牌也存储为 cookie，与 JavaScript 块中的值匹配：

>>> r = requests.post(OAuthURL)
>>> r.cookies
<RequestsCookieJar[Cookie(version=0, name='csrftoken', value='b2b621c198642e26a19fc9bf1b38d246', port=None, port_specified=False, domain='instagram.com', domain_specified=False, domain_initial_dot=False, path='/', path_specified=True, secure=False, expires=1467828030, discard=False, comment=None, comment_url=None, rest={}, rfc2109=False)]>
>>> r.cookies['csrftoken']
'b2b621c198642e26a19fc9bf1b38d246'
>>> 'b2b621c198642e26a19fc9bf1b38d246' in r.content
True
>>> token_pattern.search(r.content).group(1)
'b2b621c198642e26a19fc9bf1b38d246'

解析键值对的请求响应

Parse request response for key, value pairs

python

python-2.7

python-requests