为什么 BeautifulSoup 和 lxml 不起作用?

Why BeautifulSoup and lxml don't work?

我正在使用 mechanize 库登录网站。我检查过,效果很好。但问题是我不能将 response.read()BeautifulSoup 和 'lxml'.

一起使用
#BeautifulSoup
response = browser.open(url)
source = response.read()
soup = BeautifulSoup(source)  #source.txt doesn't work either
for link in soup.findAll('a', {'class':'someClass'}):
    some_list.add(link)

这不起作用,实际上没有找到任何标签。当我使用 requests.get(url).

时效果很好
#lxml->html
response = browser.open(url)
source = response.read()
tree = html.fromstring(source)  #souce.txt doesn't work either
print tree.text
like_pages = buyers = tree.xpath('//a[@class="UFINoWrap"]')  #/text() doesn't work either
print like_pages

不打印任何内容。我知道 return 类型的 response 有问题,因为它适用于 requests.open()。我能做什么?您能否提供 html 解析中使用 response.read() 的示例代码?

顺便问一下,responserequests 对象有什么区别?

谢谢!

我找到了解决方案。这是因为 mechanize.browser 是模拟浏览器,它只得到原始的 html。我想抓取的页面在 JavaScript 的帮助下将 class 添加到标记中,因此那些 class 不在原始 html 上。最好的选择是使用 webdriver。我将 Selenium 用于 Python。这是代码:

from selenium import webdriver

profile = webdriver.FirefoxProfile()
profile.set_preference('network.http.phishy-userpass-length', 255)
driver = webdriver.Firefox(firefox_profile=profile)

driver.get(url)
list = driver.find_elements_by_xpath('//a[@class="someClass"]')

注意:您需要安装 Firefox。或者您可以根据您要使用的浏览器选择其他配置文件。


A request is what a web client sends to a server, with details about what URL the client wants, what http verb to use (get / post, etc), and if you are submitting a form the request typically contains the data you put in the form. A response is what a web server sends back in reply to a request from a client. The response has a status code which indicates if the request was successful (code 200 usually if there were no problems, or an error code like 404 or 500). The response usually contains data, like the html in a page, or the binary data in a jpeg. The response also has headers that give more information about what data is in the response (e.g. the "Content-Type" header which says what format the data is in).

引自@davidbuxton 对此的回答link

祝你好运!