Beautiful Soup 没有选择任何元素
Beautiful Soup is not selecting any element
这是我用来遍历所有元素的代码:
soup_top = bs4.BeautifulSoup(r_top.text, 'html.parser')
selector = '#ContentPlaceHolder1_gvDisplay table tr td:nth-of-type(3) a'
for link in soup_top.select(selector):
print(link)
相同的select或在JavaScript中使用时长度为57:
document.querySelectorAll("#ContentPlaceHolder1_gvDisplay table tr td:nth-of-type(3) a").length;
我想我可能没有正确获取网页内容。然后我保存了该网页的本地副本,但 selector 在 Beautiful Soup 中仍然没有 select 任何内容。这是怎么回事?
这是website我正在使用的代码。
这似乎是由于您使用了parser(即html.parser
)。如果我用 lxml
作为解析器尝试同样的事情:
from bs4 import BeautifulSoup
import requests
url = 'http://www.swapnilpatni.com/law_charts_final.php'
r = requests.get(url)
r.raise_for_status()
soup = BeautifulSoup(r.text, 'lxml')
css_select = '#ContentPlaceHolder1_gvDisplay table tr td:nth-of-type(3) a'
links = soup.select(css_select)
print('{} link(s) found'.format(len(links)))
>> 1 link(s) found
for link in links:
print(link['href'])
>> spadmin/doc/Company Law amendment 1.1.png
html.parser
将 return 结果一直持续到 #ContentPlaceHolder1_gvDisplay table tr
,即使那样它也只是 return 第一个 tr
。
当 运行 从 url 到 W3 Markup Validation Service 时,这是 returned 的错误:
Sorry, I am unable to validate this document because on line 1212 it contained one or more bytes that I cannot interpret as utf-8 (in other words, the bytes found are not valid values in the specified Character Encoding). Please check both the content of the file and the character encoding indication.
The error was: utf8 "\xA0" does not map to Unicode
很可能 html.parser
也因此而窒息,而 lxml
更容错。
这是我用来遍历所有元素的代码:
soup_top = bs4.BeautifulSoup(r_top.text, 'html.parser')
selector = '#ContentPlaceHolder1_gvDisplay table tr td:nth-of-type(3) a'
for link in soup_top.select(selector):
print(link)
相同的select或在JavaScript中使用时长度为57:
document.querySelectorAll("#ContentPlaceHolder1_gvDisplay table tr td:nth-of-type(3) a").length;
我想我可能没有正确获取网页内容。然后我保存了该网页的本地副本,但 selector 在 Beautiful Soup 中仍然没有 select 任何内容。这是怎么回事?
这是website我正在使用的代码。
这似乎是由于您使用了parser(即html.parser
)。如果我用 lxml
作为解析器尝试同样的事情:
from bs4 import BeautifulSoup
import requests
url = 'http://www.swapnilpatni.com/law_charts_final.php'
r = requests.get(url)
r.raise_for_status()
soup = BeautifulSoup(r.text, 'lxml')
css_select = '#ContentPlaceHolder1_gvDisplay table tr td:nth-of-type(3) a'
links = soup.select(css_select)
print('{} link(s) found'.format(len(links)))
>> 1 link(s) found
for link in links:
print(link['href'])
>> spadmin/doc/Company Law amendment 1.1.png
html.parser
将 return 结果一直持续到 #ContentPlaceHolder1_gvDisplay table tr
,即使那样它也只是 return 第一个 tr
。
当 运行 从 url 到 W3 Markup Validation Service 时,这是 returned 的错误:
Sorry, I am unable to validate this document because on line 1212 it contained one or more bytes that I cannot interpret as utf-8 (in other words, the bytes found are not valid values in the specified Character Encoding). Please check both the content of the file and the character encoding indication. The error was: utf8 "\xA0" does not map to Unicode
很可能 html.parser
也因此而窒息,而 lxml
更容错。