Python 数据抓取 - 表单身份验证问题

Python Data Scrape - Form Authentication Issue

下面是我一直试图用来登录厨师的插图网站 (https://www.cooksillustrated.com/sign_in) 的一些代码。

我启动一个会话,获取身份验证令牌和一个隐藏的编码字段,然后传递电子邮件和密码字段的 "name" 和 "value"(通过检查 [= 中的元素发现) 27=]).该表单似乎不包含任何其他元素;然而,post 方法并没有让我登录。

我注意到所有 CSRF 令牌都以“==”结尾,所以我尝试删除它们。但它没有用。

我还尝试修改 post 以使用表单输入的 "id" 字段而不是 "name" (只是在黑暗中拍摄,真的......名字似乎就像我在其他示例中看到的那样应该有效。

如有任何想法,我们将不胜感激。

import requests, lxml.html
s = requests.session()

# go to the login page and get its text
login = s.get('https://www.cooksillustrated.com/sign_in')
login_html = lxml.html.fromstring(login.text)

# find the hidden fields names and values; store in a dictionary
hidden_inputs = login_html.xpath(r'//form//input[@type="hidden"]')
form = {x.attrib['name']: x.attrib['value'] for x in hidden_inputs}
print(form)

# I noticed that they all ended in two = signs, so I tried taking that off
# form['authenticity_token'] = form['authenticity_token'][:-2]

# this adds to the form payload the two named fields for user name and     password
# found using the "inspect elements" on the login screen
form['user[email]'] = 'my_email'
form['user[password]'] = 'my_pw'

# this uses "id" instead of "name" from the input fields
#form['user_email'] = 'my_email'
#form['user_password'] = 'my_pw'

response = s.post('https://www.cooksillustrated.com/sign_in', data=form)
print(form)

# trying to see if it worked - but the response URL is login again instead of main page
# and it can't find my name
# responses are okay, but I think that just means it posted the form
print(response.url)
print('Christopher' in response.text)
print(response.status_code)
print(response.ok)

嗯,POST 请求 URL 应该是 https://www.cooksillustrated.com/sessions,如果你在登录时捕获所有流量,你会发现实际的 POST 请求发送到服务器:

POST /sessions HTTP/1.1
Host: www.cooksillustrated.com
Connection: keep-alive
Content-Length: 179
Cache-Control: max-age=0
Origin: https://www.cooksillustrated.com
Upgrade-Insecure-Requests: 1
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.76 Safari/537.36
Content-Type: application/x-www-form-urlencoded
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8
Referer: https://www.cooksillustrated.com/sign_in
Accept-Encoding: gzip, deflate, br
Accept-Language: en-US,en;q=0.8

utf8=%E2%9C%93&authenticity_token=Uvku64N8V2dq8z%2BGerrqWNobn03Ydjvz8xqgOAvfBmvDM%2B71xJWl2DmRU4zbBE15gGVESmDKP2E16KIqBeAJ0g%3D%3D&user%5Bemail%5D=demo&user%5Bpassword%5D=demodemo

注意最后一行是这个请求的编码数据,有4个参数,分别是utfauthenticity_tokenuser[email]user[password]

所以在你的情况下,form 应该包括所有这些:

form = {'user[email]': 'my_email', 
        'user[password]': 'my_pw', 
        'utf': '✓', 
        'authenticity_token': 'xxxxxx' # make sure you don't ignore '=='
}

此外,您可能希望添加一些 header 以显示为来自 Chrome(或您喜欢的任何浏览器),因为 [=20= 的默认 header ] 是 python-requests/2.13.0,有些网站不喜欢来自 "bots" 的流量:

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.76 Safari/537.36', 
           'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8', 
           'Accept-Encoding': 'gzip, deflate, br', 
           ... # more
}

现在我们准备发出 POST 请求:

response = s.post('https://www.cooksillustrated.com/sessions', data=form, headers=headers)