如何在简单的网络抓取中停止 302 url 重定向?
How do I stop the 302 url redirection in a simple web crawl?
我正在尝试使用 Python 中的 Requests 库抓取网站,当我尝试时:
r = requests.get('http://www.cell.com/cell-stem-cell/home', allow_redirects = False)
>>> r.status_code
302
>>> r.text
'The URL has moved <a href="https://secure.jbs.elsevierhealth.com/action/getSharedSiteSession?redirect=http%3A%2F%2Fwww.cell.com%2Fcell-stem-cell%2Fhome&rc=0&code=cell-site">here</a>\n'
当我尝试时:
>>> r = requests.get("https://secure.jbs.elsevierhealth.com/action/getSharedSiteSession?redirect=http%3A%2F%2Fwww.cell.com%2Fcell-stem-cell%2Fhome&rc=0&code=cell-site")
>>>
>>> r.text
'\n\n\n\n\n<style type="text/css">\n .hidden {\n display: none;\n visibility: hidden;\n }\n</style>\n\n<!-- hidden iFrame for each of the SSO URLs -->\n<div class="hidden">\n \n <iframe src="//acw.secure.jbs.elsevierhealth.com/SSOCore/update?utt=81c120bb854495181ef4ef3f679b12261e956c5-JKh">Your browser doesn\'t support iFrames!</iframe>\n \n <iframe src="//acw.sciencedirect.com/SSOCore/update?utt=81c120bb854495181ef4ef3f679b12261e956c5-JKh">Your browser doesn\'t support iFrames!</iframe>\n \n <iframe src="//acw.scopus.com/SSOCore/update?utt=81c120bb854495181ef4ef3f679b12261e956c5-JKh">Your browser doesn\'t support iFrames!</iframe>\n \n <iframe src="//acw.sciverse.com/SSOCore/update?utt=81c120bb854495181ef4ef3f679b12261e956c5-JKh">Your browser doesn\'t support iFrames!</iframe>\n \n <iframe src="//acw.mendeley.com/SSOCore/update?utt=81c120bb854495181ef4ef3f679b12261e956c5-JKh">Your browser doesn\'t support iFrames!</iframe>\n \n <iframe src="//acw.elsevier.com/SSOCore/update?utt=81c120bb854495181ef4ef3f679b12261e956c5-JKh">Your browser doesn\'t support iFrames!</iframe>\n \n</div>\n\n\n\n<noscript>\n <a href="CANT POST LINK BECAUSE OF LACK OF REPUTATION POINTS OF STACK OVERFLOW">Redirect</a>\n</noscript>\n\n<!-- redirect to the product page after all iFrames are rendered -->\n<script>\n setTimeout(redirectFun,2000);\n var iFramesList = document.getElementsByTagName("iframe");\n var renderedIFramesCount = 0;\n var numberOfIFrames = iFramesList.length;\n for (var i = 0; i < iFramesList.length; i++) {\n var iFrame = iFramesList[i];\n bindEvent(iFrame, \'load\', function(){\n renderedIFramesCount = renderedIFramesCount + 1;\n if (renderedIFramesCount >= numberOfIFrames)\n {\n redirectFun();\n }\n });\n }\n var doRedirect = true;\n function redirectFun() {\n if (doRedirect)\n window.location.href = "CANT POST THIS WEBSITE BECAUSE OF MY REPUTATION POINTS ON Whosebug";\n doRedirect = false;\n }\n\n function bindEvent(el, eventName, eventHandler) {\n if (el.addEventListener){\n el.addEventListener(eventName, eventHandler, false);\n } else if (el.attachEvent){\n el.attachEvent(eventName, eventHandler);\n }\n }\n</script>\n\n'
我只想得到原网站的HTML。
您必须连同请求 header 一起发送 User-agent 以使网站相信该请求来自真实的 Web 浏览器。所以如果你想要 non-redirected url 的内容,你的代码应该是
from requests import get
content = get('http://www.cell.com/cell-stem-cell/home', headers = {'User-agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36
(KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36'},allow_redirects = False).content
print content
输出将是:
The URL has moved <a href="https://secure.jbs.elsevierhealth.com/action/getShar
edSiteSession?redirect=http%3A%2F%2Fwww.cell.com%2Fcell-stem-cell%2Fhome&rc=0&co
de=cell-site">here</a>
如果您想要重定向的内容 url 则允许重定向,但包括 user-agent header。此方法适用于大多数不在其网站上使用动态内容的网站。如果您想从动态内容网站抓取数据,则必须使用网络浏览器模拟器,例如 selinium.
你只需要很少的工作就可以直接得到它。当需要重定向时,服务器发送 Location header。您只需要访问该位置 header.
中的 URL
r = requests.get('http://www.cell.com/cell-stem-cell/home')
if r.status_code==302:
r1 = requests.get(r.headers['Location'])
您将在 r1.content
或 r1.text
中获得所需的数据
我正在尝试使用 Python 中的 Requests 库抓取网站,当我尝试时:
r = requests.get('http://www.cell.com/cell-stem-cell/home', allow_redirects = False)
>>> r.status_code
302
>>> r.text
'The URL has moved <a href="https://secure.jbs.elsevierhealth.com/action/getSharedSiteSession?redirect=http%3A%2F%2Fwww.cell.com%2Fcell-stem-cell%2Fhome&rc=0&code=cell-site">here</a>\n'
当我尝试时:
>>> r = requests.get("https://secure.jbs.elsevierhealth.com/action/getSharedSiteSession?redirect=http%3A%2F%2Fwww.cell.com%2Fcell-stem-cell%2Fhome&rc=0&code=cell-site")
>>>
>>> r.text
'\n\n\n\n\n<style type="text/css">\n .hidden {\n display: none;\n visibility: hidden;\n }\n</style>\n\n<!-- hidden iFrame for each of the SSO URLs -->\n<div class="hidden">\n \n <iframe src="//acw.secure.jbs.elsevierhealth.com/SSOCore/update?utt=81c120bb854495181ef4ef3f679b12261e956c5-JKh">Your browser doesn\'t support iFrames!</iframe>\n \n <iframe src="//acw.sciencedirect.com/SSOCore/update?utt=81c120bb854495181ef4ef3f679b12261e956c5-JKh">Your browser doesn\'t support iFrames!</iframe>\n \n <iframe src="//acw.scopus.com/SSOCore/update?utt=81c120bb854495181ef4ef3f679b12261e956c5-JKh">Your browser doesn\'t support iFrames!</iframe>\n \n <iframe src="//acw.sciverse.com/SSOCore/update?utt=81c120bb854495181ef4ef3f679b12261e956c5-JKh">Your browser doesn\'t support iFrames!</iframe>\n \n <iframe src="//acw.mendeley.com/SSOCore/update?utt=81c120bb854495181ef4ef3f679b12261e956c5-JKh">Your browser doesn\'t support iFrames!</iframe>\n \n <iframe src="//acw.elsevier.com/SSOCore/update?utt=81c120bb854495181ef4ef3f679b12261e956c5-JKh">Your browser doesn\'t support iFrames!</iframe>\n \n</div>\n\n\n\n<noscript>\n <a href="CANT POST LINK BECAUSE OF LACK OF REPUTATION POINTS OF STACK OVERFLOW">Redirect</a>\n</noscript>\n\n<!-- redirect to the product page after all iFrames are rendered -->\n<script>\n setTimeout(redirectFun,2000);\n var iFramesList = document.getElementsByTagName("iframe");\n var renderedIFramesCount = 0;\n var numberOfIFrames = iFramesList.length;\n for (var i = 0; i < iFramesList.length; i++) {\n var iFrame = iFramesList[i];\n bindEvent(iFrame, \'load\', function(){\n renderedIFramesCount = renderedIFramesCount + 1;\n if (renderedIFramesCount >= numberOfIFrames)\n {\n redirectFun();\n }\n });\n }\n var doRedirect = true;\n function redirectFun() {\n if (doRedirect)\n window.location.href = "CANT POST THIS WEBSITE BECAUSE OF MY REPUTATION POINTS ON Whosebug";\n doRedirect = false;\n }\n\n function bindEvent(el, eventName, eventHandler) {\n if (el.addEventListener){\n el.addEventListener(eventName, eventHandler, false);\n } else if (el.attachEvent){\n el.attachEvent(eventName, eventHandler);\n }\n }\n</script>\n\n'
我只想得到原网站的HTML。
您必须连同请求 header 一起发送 User-agent 以使网站相信该请求来自真实的 Web 浏览器。所以如果你想要 non-redirected url 的内容,你的代码应该是
from requests import get
content = get('http://www.cell.com/cell-stem-cell/home', headers = {'User-agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36
(KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36'},allow_redirects = False).content
print content
输出将是:
The URL has moved <a href="https://secure.jbs.elsevierhealth.com/action/getShar
edSiteSession?redirect=http%3A%2F%2Fwww.cell.com%2Fcell-stem-cell%2Fhome&rc=0&co
de=cell-site">here</a>
如果您想要重定向的内容 url 则允许重定向,但包括 user-agent header。此方法适用于大多数不在其网站上使用动态内容的网站。如果您想从动态内容网站抓取数据,则必须使用网络浏览器模拟器,例如 selinium.
你只需要很少的工作就可以直接得到它。当需要重定向时,服务器发送 Location header。您只需要访问该位置 header.
中的 URLr = requests.get('http://www.cell.com/cell-stem-cell/home')
if r.status_code==302:
r1 = requests.get(r.headers['Location'])
您将在 r1.content
或 r1.text