Beautiful Soup 没有选择任何元素

Question

这是我用来遍历所有元素的代码：

soup_top = bs4.BeautifulSoup(r_top.text, 'html.parser')

selector = '#ContentPlaceHolder1_gvDisplay table tr td:nth-of-type(3) a'

for link in soup_top.select(selector):
    print(link)

相同的select或在JavaScript中使用时长度为57：

document.querySelectorAll("#ContentPlaceHolder1_gvDisplay table tr td:nth-of-type(3) a").length;

我想我可能没有正确获取网页内容。然后我保存了该网页的本地副本，但 selector 在 Beautiful Soup 中仍然没有 select 任何内容。这是怎么回事？

这是website我正在使用的代码。

Answer 1

这似乎是由于您使用了parser（即html.parser）。如果我用 lxml 作为解析器尝试同样的事情：

from bs4 import BeautifulSoup
import requests

url = 'http://www.swapnilpatni.com/law_charts_final.php'
r = requests.get(url)
r.raise_for_status()

soup = BeautifulSoup(r.text, 'lxml')

css_select = '#ContentPlaceHolder1_gvDisplay table tr td:nth-of-type(3) a'
links = soup.select(css_select)
print('{} link(s) found'.format(len(links)))

>> 1 link(s) found

for link in links:
    print(link['href'])

>> spadmin/doc/Company Law amendment 1.1.png

html.parser 将 return 结果一直持续到 #ContentPlaceHolder1_gvDisplay table tr，即使那样它也只是 return 第一个 tr。

当运行从 url 到 W3 Markup Validation Service 时，这是 returned 的错误：

Sorry, I am unable to validate this document because on line 1212 it contained one or more bytes that I cannot interpret as utf-8 (in other words, the bytes found are not valid values in the specified Character Encoding). Please check both the content of the file and the character encoding indication. The error was: utf8 "\xA0" does not map to Unicode

很可能 html.parser 也因此而窒息，而 lxml 更容错。

Beautiful Soup 没有选择任何元素

Beautiful Soup is not selecting any element

python

web-scraping

python-3.x

python-3.5

bs4