由于私有模式检测,urllib3 无法打开与 urllib2 能够打开的同一篇文章
urllib3 does not open the same article as urllib2 was able to open due to private mode detection
如何使用 urllib3 绕过私有模式检测。我有以下不起作用的代码:
import urllib3
from bs4 import BeautifulSoup
articleURL = "https://www.washingtonpost.com/news/the-switch/wp/2016/10/18/the-pentagons-massive-new-telescope-is-designed-to-track-space-junk-and-watch-out-for-killer-asteroids/"
import urllib3
from bs4 import BeautifulSoup
urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)
http = urllib3.PoolManager()
response = http.request('GET', articleURL)
soup = BeautifulSoup(response.data.decode('utf-8', 'ignore'))
soup
这会产生以下错误:
</script> <script>var _0x108f=["blockers","pb-adblock-checked","resolve","all","overlay","mobile","desktop","browsers","max","isAnon","isSubscriber","Features","displayOverlay","extListener","getTime","performance","timing","navigationStart","registerPwapiConsumer","getOwnPropertyDescriptor","get","reject","notdetected","standard","notblocked","stack","validate","addEventListener","pb-core-loaded","iterator","symbol","function","constructor","prototype","assign","apply","Keep supporting great journalism by turning off your ad blocker. Or purchase a subscription for unlimited access to real news you can count on.",
'\x3ca data-link-ff\x3d"https://www.washingtonpost.com/steps-for-disabling-firefoxs-native-adblocker/2018/05/21/fb95bf4e-5d37-11e8-b2b8-08a538d9dbd6_story.html" data-link\x3d"https://www.washingtonpost.com/steps-for-disabling-adblocker/2016/09/14/a8c3d4d2-7aac-11e6-bd86-b7bbd53d2b5d_story.html" href\x3d"https://www.washingtonpost.com/steps-for-disabling-adblocker/2016/09/14/a8c3d4d2-7aac-11e6-bd86-b7bbd53d2b5d_story.html"\x3eUnblock ads\x3c/a\x3e','\x3ca href\x3d"https://subscribe.washingtonpost.com/acq/?promo\x3do12" target\x3d"_blank"\x3e\x3cspan class\x3d"subscribe-link"\x3eTry 1 month for \x3c/span\x3e\x3c/a\x3e',
"event 86","We noticed you\u2019re browsing in private mode.","Private browsing is permitted exclusively for our subscribers. Turn off private browsing to keep reading this story, or subscribe to use this feature, plus get unlimited digital access.",'\x3ca data-link-ff\x3d"https://helpcenter.washingtonpost.com/hc/en-us/articles/360028029392l" data-link\x3d"https://helpcenter.washingtonpost.com/hc/en-us/articles/360028029392" href\x3d"https://helpcenter.washingtonpost.com/hc/en-us/articles/360028029392"\x3eTurn off private browsing\x3c/a\x3e'
我无意触发此警告,它与 urllib2 一起工作正常:
import urllib2
from bs4 import BeautifulSoup
articleURL = "https://www.washingtonpost.com/news/the-switch/wp/2016/10/18/the-pentagons-massive-new-telescope-is-designed-to-track-space-junk-and-watch-out-for-killer-asteroids/"
page = urllib2.urlopen(articleURL).read().decode('utf8','ignore')
soup = BeautifulSoup(page,"lxml")
soup
尝试此更改(您需要指定 user-agent
header):
headers = {'user-agent': 'Mozilla/5.0 (Windows NT x.y; Win64; x64; rv:10.0) Gecko/20100101 Firefox/10.0'}
response = http.request('GET', articleURL, headers=headers)
如何使用 urllib3 绕过私有模式检测。我有以下不起作用的代码:
import urllib3
from bs4 import BeautifulSoup
articleURL = "https://www.washingtonpost.com/news/the-switch/wp/2016/10/18/the-pentagons-massive-new-telescope-is-designed-to-track-space-junk-and-watch-out-for-killer-asteroids/"
import urllib3
from bs4 import BeautifulSoup
urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)
http = urllib3.PoolManager()
response = http.request('GET', articleURL)
soup = BeautifulSoup(response.data.decode('utf-8', 'ignore'))
soup
这会产生以下错误:
</script> <script>var _0x108f=["blockers","pb-adblock-checked","resolve","all","overlay","mobile","desktop","browsers","max","isAnon","isSubscriber","Features","displayOverlay","extListener","getTime","performance","timing","navigationStart","registerPwapiConsumer","getOwnPropertyDescriptor","get","reject","notdetected","standard","notblocked","stack","validate","addEventListener","pb-core-loaded","iterator","symbol","function","constructor","prototype","assign","apply","Keep supporting great journalism by turning off your ad blocker. Or purchase a subscription for unlimited access to real news you can count on.",
'\x3ca data-link-ff\x3d"https://www.washingtonpost.com/steps-for-disabling-firefoxs-native-adblocker/2018/05/21/fb95bf4e-5d37-11e8-b2b8-08a538d9dbd6_story.html" data-link\x3d"https://www.washingtonpost.com/steps-for-disabling-adblocker/2016/09/14/a8c3d4d2-7aac-11e6-bd86-b7bbd53d2b5d_story.html" href\x3d"https://www.washingtonpost.com/steps-for-disabling-adblocker/2016/09/14/a8c3d4d2-7aac-11e6-bd86-b7bbd53d2b5d_story.html"\x3eUnblock ads\x3c/a\x3e','\x3ca href\x3d"https://subscribe.washingtonpost.com/acq/?promo\x3do12" target\x3d"_blank"\x3e\x3cspan class\x3d"subscribe-link"\x3eTry 1 month for \x3c/span\x3e\x3c/a\x3e',
"event 86","We noticed you\u2019re browsing in private mode.","Private browsing is permitted exclusively for our subscribers. Turn off private browsing to keep reading this story, or subscribe to use this feature, plus get unlimited digital access.",'\x3ca data-link-ff\x3d"https://helpcenter.washingtonpost.com/hc/en-us/articles/360028029392l" data-link\x3d"https://helpcenter.washingtonpost.com/hc/en-us/articles/360028029392" href\x3d"https://helpcenter.washingtonpost.com/hc/en-us/articles/360028029392"\x3eTurn off private browsing\x3c/a\x3e'
我无意触发此警告,它与 urllib2 一起工作正常:
import urllib2
from bs4 import BeautifulSoup
articleURL = "https://www.washingtonpost.com/news/the-switch/wp/2016/10/18/the-pentagons-massive-new-telescope-is-designed-to-track-space-junk-and-watch-out-for-killer-asteroids/"
page = urllib2.urlopen(articleURL).read().decode('utf8','ignore')
soup = BeautifulSoup(page,"lxml")
soup
尝试此更改(您需要指定 user-agent
header):
headers = {'user-agent': 'Mozilla/5.0 (Windows NT x.y; Win64; x64; rv:10.0) Gecko/20100101 Firefox/10.0'}
response = http.request('GET', articleURL, headers=headers)