无法从美丽的汤中读取 html 页
Unable to read html page from beautiful soup
以下代码在输出中打印 hi 后卡住了。你能检查一下这有什么问题吗?如果网站是安全的并且我需要一些特殊的身份验证?
from bs4 import BeautifulSoup
import requests
print('hi')
rooturl='http://www.hoovers.com/company-information/company-search.html'
r=requests.get(rooturl);
print('hi1')
soup=BeautifulSoup(r.content,"html.parser");
print('hi2')
print(soup)
遇到了和你一样的问题。就坐在那里。
我尝试通过添加用户代理,它很快就把它拉下来了。不知道为什么会这样。
from bs4 import BeautifulSoup
import requests
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
print('hi')
rooturl='http://www.hoovers.com/company-information/company-search.html'
r=requests.get(rooturl, headers=headers)
print('hi1')
soup=BeautifulSoup(r.content,"html.parser");
print('hi2')
print(soup)
编辑:太奇怪了。现在它不再适合我了。它首先不起作用。然后它做到了。现在不是了。但是使用 Selenium 还有另一种可能的选择。
from bs4 import BeautifulSoup
import requests
from selenium import webdriver
browser = webdriver.Chrome()
browser.get('http://www.hoovers.com/company-information/company-search.html')
r = browser.page_source
print('hi1')
soup=BeautifulSoup(r,"html.parser")
print('hi2')
print(soup)
browser.close()
Unable to read html page from beautiful soup
您遇到此问题的原因是网站认为您是机器人,他们不会向您发送任何内容。而且他们还挂断了连接让你一直等
You just imitate browser's request, then server will consider you are not an robot.
添加headers是处理这个问题最简单的方法。但是你不应该只传递 User-Agent
的东西(就像这次)。请记住复制浏览器的请求并通过测试删除无用的元素。懒的直接用浏览器的headers,上传文件的时候千万不要全部复制
from bs4 import BeautifulSoup
import requests
rooturl='http://www.hoovers.com/company-information/company-search.html'
with requests.Session() as se:
se.headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.110 Safari/537.36",
"Accept-Encoding": "gzip, deflate",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
"Accept-Language": "en"
}
resp = se.get(rooturl)
print(resp.content)
soup = BeautifulSoup(resp.content,"html.parser")
以下代码在输出中打印 hi 后卡住了。你能检查一下这有什么问题吗?如果网站是安全的并且我需要一些特殊的身份验证?
from bs4 import BeautifulSoup
import requests
print('hi')
rooturl='http://www.hoovers.com/company-information/company-search.html'
r=requests.get(rooturl);
print('hi1')
soup=BeautifulSoup(r.content,"html.parser");
print('hi2')
print(soup)
遇到了和你一样的问题。就坐在那里。 我尝试通过添加用户代理,它很快就把它拉下来了。不知道为什么会这样。
from bs4 import BeautifulSoup
import requests
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
print('hi')
rooturl='http://www.hoovers.com/company-information/company-search.html'
r=requests.get(rooturl, headers=headers)
print('hi1')
soup=BeautifulSoup(r.content,"html.parser");
print('hi2')
print(soup)
编辑:太奇怪了。现在它不再适合我了。它首先不起作用。然后它做到了。现在不是了。但是使用 Selenium 还有另一种可能的选择。
from bs4 import BeautifulSoup
import requests
from selenium import webdriver
browser = webdriver.Chrome()
browser.get('http://www.hoovers.com/company-information/company-search.html')
r = browser.page_source
print('hi1')
soup=BeautifulSoup(r,"html.parser")
print('hi2')
print(soup)
browser.close()
Unable to read html page from beautiful soup
您遇到此问题的原因是网站认为您是机器人,他们不会向您发送任何内容。而且他们还挂断了连接让你一直等
You just imitate browser's request, then server will consider you are not an robot.
添加headers是处理这个问题最简单的方法。但是你不应该只传递 User-Agent
的东西(就像这次)。请记住复制浏览器的请求并通过测试删除无用的元素。懒的直接用浏览器的headers,上传文件的时候千万不要全部复制
from bs4 import BeautifulSoup
import requests
rooturl='http://www.hoovers.com/company-information/company-search.html'
with requests.Session() as se:
se.headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.110 Safari/537.36",
"Accept-Encoding": "gzip, deflate",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
"Accept-Language": "en"
}
resp = se.get(rooturl)
print(resp.content)
soup = BeautifulSoup(resp.content,"html.parser")