BeautifulSoup return 清空使用 find_all("span", text = re.compile("T"))
BeautifulSoup return empty using find_all("span", text = re.compile("T"))
html 文件可以从 here
下载
soup = BeautifulSoup(open(r"test.html"),from_encoding="ascii")
In [43]:soup.find_all("span")
Out[43]:
[<span style="position:absolute; border: gray 1px solid; left:0px; top:50px; width:648px; height:783px;"></span>,
<span style="font-family: LJOGFN+HelveticaNeueLTStd-Bd; font-size:7px">S
<br/></span>,
<span style="font-family: LJOGFN+HelveticaNeueLTStd-Bd; font-size:7px">T
<br/></span>,
<span style="font-family: LJOGFN+HelveticaNeueLTStd-Bd; font-size:8px">N
<br/></span>,
<span style="font-family: LJOGFN+HelveticaNeueLTStd-Bd; font-size:7px">E
<br/></span>,
<span style="font-family: LJOGFN+HelveticaNeueLTStd-Bd; font-size:7px">T
<br/></span>,
<span style="font-family: LJOGFN+HelveticaNeueLTStd-Bd; font-size:8px">N
<br/></span>]
In [44]:soup.find_all("span", text = re.compile("T"))
Out[44]:[]
为什么它 return 是空列表?这与编码有关吗?
更新:以下代码有效:
In [87]:
def aa(tag):
return tag.name == "span" and re.match("T", tag.text)
In [88]:soup.find_all(aa)[0]
它是如何工作的?
根据文档 (http://www.crummy.com/software/BeautifulSoup/bs4/doc/#the-text-argument),您的代码应该可以工作。您应该提交错误报告。
编辑:看起来这个问题是由 <span>
元素中的 <br>
标签引起的。这绝对是一个错误。
要解决这个问题,请使用 lambda,这样您就不需要定义函数了:
soup.find_all(lambda tag: tag.name == "span" and re.match("T", tag.text))
html 文件可以从 here
下载soup = BeautifulSoup(open(r"test.html"),from_encoding="ascii")
In [43]:soup.find_all("span")
Out[43]:
[<span style="position:absolute; border: gray 1px solid; left:0px; top:50px; width:648px; height:783px;"></span>,
<span style="font-family: LJOGFN+HelveticaNeueLTStd-Bd; font-size:7px">S
<br/></span>,
<span style="font-family: LJOGFN+HelveticaNeueLTStd-Bd; font-size:7px">T
<br/></span>,
<span style="font-family: LJOGFN+HelveticaNeueLTStd-Bd; font-size:8px">N
<br/></span>,
<span style="font-family: LJOGFN+HelveticaNeueLTStd-Bd; font-size:7px">E
<br/></span>,
<span style="font-family: LJOGFN+HelveticaNeueLTStd-Bd; font-size:7px">T
<br/></span>,
<span style="font-family: LJOGFN+HelveticaNeueLTStd-Bd; font-size:8px">N
<br/></span>]
In [44]:soup.find_all("span", text = re.compile("T"))
Out[44]:[]
为什么它 return 是空列表?这与编码有关吗?
更新:以下代码有效:
In [87]:
def aa(tag):
return tag.name == "span" and re.match("T", tag.text)
In [88]:soup.find_all(aa)[0]
它是如何工作的?
根据文档 (http://www.crummy.com/software/BeautifulSoup/bs4/doc/#the-text-argument),您的代码应该可以工作。您应该提交错误报告。
编辑:看起来这个问题是由 <span>
元素中的 <br>
标签引起的。这绝对是一个错误。
要解决这个问题,请使用 lambda,这样您就不需要定义函数了:
soup.find_all(lambda tag: tag.name == "span" and re.match("T", tag.text))