使用 MechanicalSoup 成功登录后,在抓取站点 returns 时再次登录页面?
Site returns login page again when scraping after logging in successfully once using MechanicalSoup?
我正在尝试使用 BeautifulSoup 作为项目的一部分从 Twitter 上抓取一些数据。要抓取“以下”部分,我需要先登录,所以我尝试使用 MechanicalSoup 这样做。我知道登录成功,因为我收到一封电子邮件,但当我转到同一网站的不同页面以抓取数据时,它再次将我重定向到登录页面。
import mechanicalsoup
browser = mechanicalsoup.StatefulBrowser(soup_config={'features': 'lxml'},
raise_on_404=True,
user_agent='MyBot/0.1: mysite.example.com/bot_info',)
login_page = browser.get("https://twitter.com/login")
login_form = login_page.soup.findAll("form")
login_form = login_form[2]
login_form.find("input", {"name": "session[username_or_email]"})["value"] = "puturusername"
login_form.find("input", {"name": "session[password]"})["value"] = "puturpassword"
login_response = browser.submit(login_form, login_page.url)
login_response.soup()
这向我发送了一封成功的登录电子邮件,我尝试了:
page_stml = browser.open('https://twitter.com/MKBHD/following').text
page_soup = soup(page_html,"html.parser")
page_soup
我收到的页面包含 https://twitter.com/login?redirect_after_login=%2FMKBHD%2Ffollowing&
而不是实际的“后续”页面。
如果我尝试下面给出的代码而不是 'browser.open('https://twitter.com/MKBHD/following').text':
# verify we are now logged in
page = browser.get_current_page()
print(page)
messages = page.find("div", class_="flash-messages")
if messages:
print(messages.text)
assert page.select(".logout-form")
print(page.title.text)
# verify we remain logged in (thanks to cookies) as we browse the rest of
# the site
page3 = browser.open("https://github.com/MechanicalSoup/MechanicalSoup")
assert page3.soup.select(".logout-form”)
我得到输出:
----> 4 messages = page.find("div", class_="flash-messages")
AttributeError: 'NoneType' object has no attribute ‘find’
更新:
login_response.soup()
给我以下内容:
</style>, <body>
<noscript>
<center>If you’re not redirected soon, please <a href="/">use this link</a>.</center>
</noscript>
<script nonce="O1gf092z/sXmKkH64mLOzQ==">
document.cookie = "app_shell_visited=1;path=/;max-age=5";
location.replace(location.href.split("#")[0]);
</script>
</body>, <noscript>
<center>If you’re not redirected soon, please <a href="/">use this link</a>.</center>
</noscript>, <center>If you’re not redirected soon, please <a href="/">use this link</a>.</center>, <a href="/">use this link</a>, <script nonce="O1gf092z/sXmKkH64mLOzQ==">
document.cookie = "app_shell_visited=1;path=/;max-age=5";
location.replace(location.href.split("#")[0]);
</script>]
为避免获取重定向页面,您可以使用 StatefulBrowser() 对象而不是 Browser()。
我写了一篇关于它的短文 post : https://piratefache.ch/python-3-mechanize-and-beautifulsoup
import mechanicalsoup
if __name__ == "__main__":
URL = "https://twitter.com/login"
LOGIN = "your_login"
PASSWORD = "your_password"
TWITTER_NAME = "displayed_name" # Displayed username on Twitter
# Create a browser object
browser = mechanicalsoup.StatefulBrowser()
# request Twitter login page
browser.open(URL)
# we grab the login form
browser.select_form('form[action="https://twitter.com/sessions"]')
# print form inputs
browser.get_current_form().print_summary()
# specify username and password
browser["session[username_or_email]"] = LOGIN
browser["session[password]"] = PASSWORD
# submit form
response = browser.submit_selected()
# get current page output
response_after_login = browser.get_current_page()
# verify we are now logged in ( get img alt element containing username )
# if you found a better way to check, let me know. Since twitter generate dynamically all theirs classes, its
# pretty complicated to get better information
user_element = response_after_login.select_one("img[alt="+TWITTER_NAME+"]")
# if username is in the img field, it means the user is successfully connected
if TWITTER_NAME in str(user_element):
print("You're connected as " + TWITTER_NAME)
else:
print("Not connected")
来源:
我正在尝试使用 BeautifulSoup 作为项目的一部分从 Twitter 上抓取一些数据。要抓取“以下”部分,我需要先登录,所以我尝试使用 MechanicalSoup 这样做。我知道登录成功,因为我收到一封电子邮件,但当我转到同一网站的不同页面以抓取数据时,它再次将我重定向到登录页面。
import mechanicalsoup
browser = mechanicalsoup.StatefulBrowser(soup_config={'features': 'lxml'},
raise_on_404=True,
user_agent='MyBot/0.1: mysite.example.com/bot_info',)
login_page = browser.get("https://twitter.com/login")
login_form = login_page.soup.findAll("form")
login_form = login_form[2]
login_form.find("input", {"name": "session[username_or_email]"})["value"] = "puturusername"
login_form.find("input", {"name": "session[password]"})["value"] = "puturpassword"
login_response = browser.submit(login_form, login_page.url)
login_response.soup()
这向我发送了一封成功的登录电子邮件,我尝试了:
page_stml = browser.open('https://twitter.com/MKBHD/following').text
page_soup = soup(page_html,"html.parser")
page_soup
我收到的页面包含 https://twitter.com/login?redirect_after_login=%2FMKBHD%2Ffollowing&
而不是实际的“后续”页面。
如果我尝试下面给出的代码而不是 'browser.open('https://twitter.com/MKBHD/following').text':
# verify we are now logged in
page = browser.get_current_page()
print(page)
messages = page.find("div", class_="flash-messages")
if messages:
print(messages.text)
assert page.select(".logout-form")
print(page.title.text)
# verify we remain logged in (thanks to cookies) as we browse the rest of
# the site
page3 = browser.open("https://github.com/MechanicalSoup/MechanicalSoup")
assert page3.soup.select(".logout-form”)
我得到输出:
----> 4 messages = page.find("div", class_="flash-messages")
AttributeError: 'NoneType' object has no attribute ‘find’
更新:
login_response.soup()
给我以下内容:
</style>, <body>
<noscript>
<center>If you’re not redirected soon, please <a href="/">use this link</a>.</center>
</noscript>
<script nonce="O1gf092z/sXmKkH64mLOzQ==">
document.cookie = "app_shell_visited=1;path=/;max-age=5";
location.replace(location.href.split("#")[0]);
</script>
</body>, <noscript>
<center>If you’re not redirected soon, please <a href="/">use this link</a>.</center>
</noscript>, <center>If you’re not redirected soon, please <a href="/">use this link</a>.</center>, <a href="/">use this link</a>, <script nonce="O1gf092z/sXmKkH64mLOzQ==">
document.cookie = "app_shell_visited=1;path=/;max-age=5";
location.replace(location.href.split("#")[0]);
</script>]
为避免获取重定向页面,您可以使用 StatefulBrowser() 对象而不是 Browser()。
我写了一篇关于它的短文 post : https://piratefache.ch/python-3-mechanize-and-beautifulsoup
import mechanicalsoup
if __name__ == "__main__":
URL = "https://twitter.com/login"
LOGIN = "your_login"
PASSWORD = "your_password"
TWITTER_NAME = "displayed_name" # Displayed username on Twitter
# Create a browser object
browser = mechanicalsoup.StatefulBrowser()
# request Twitter login page
browser.open(URL)
# we grab the login form
browser.select_form('form[action="https://twitter.com/sessions"]')
# print form inputs
browser.get_current_form().print_summary()
# specify username and password
browser["session[username_or_email]"] = LOGIN
browser["session[password]"] = PASSWORD
# submit form
response = browser.submit_selected()
# get current page output
response_after_login = browser.get_current_page()
# verify we are now logged in ( get img alt element containing username )
# if you found a better way to check, let me know. Since twitter generate dynamically all theirs classes, its
# pretty complicated to get better information
user_element = response_after_login.select_one("img[alt="+TWITTER_NAME+"]")
# if username is in the img field, it means the user is successfully connected
if TWITTER_NAME in str(user_element):
print("You're connected as " + TWITTER_NAME)
else:
print("Not connected")
来源: