Python Beautifulsoup 循环标签 (<td><b>) 并获取其所有兄弟 (a href)
Python Beautifulsoup loop a tag (<td><b>) and get all its sibling (a href)
我有以下 html 文件遍历 Python 的 beautifulsoup:
<table align=center border='1' cellpadding="8"><tr><td><b>1940 (Spanish) Jan</b>
<a href="./1940sp/jan/2/home.htm" target="_parent">2</a> 
<a href="./1940sp/jan/4/home.htm" target="_parent">4</a> 
<td><b>1940 (English) Jan</b>
<a href="./1940/jan/2/home.htm" target="_parent">2</a> 
<a href="./1940/jan/4/home.htm" target="_parent">4</a> 
<tr><td><b>1940 (Spanish) Feb</b>
<a href="./1940sp/feb/1/home.htm" target="_parent">1</a> 
...OMITTED...
<td><b>1940 (English) Indices</b>
<a href="./1940/ndx1/home.htm" target="_parent">Jan to Mar</a> 
</table>
这个 html 有些有关闭的 td 标签,有些没有,但我想这无关紧要。我想要得到的是 hrefs 的文本和相应的粗体文本,如下所示:
1940 (Spanish) Jan|2
1940 (Spanish) Jan|4
1940 (English) Jan|2
1940 (English) Jan|4
...
1940 (English) Indices|Jan to Mar
我实际上可以用我的代码迭代粗体 tds,我想弄清楚的是迭代 hrefs 文本的部分。我现在拥有的 python 代码如下:
import requests
url = "http://nlpdl.nlp.gov.ph/OG01/1902"
page = requests.get(url)
from bs4 import BeautifulSoup
soup = BeautifulSoup(page.content, 'html.parser')
elements = soup.find("td").find_all_next("b")
for el in elements:
print (el)
提前致谢!
这应该对您有帮助:
from bs4 import BeautifulSoup
html = """
<table align=center border='1' cellpadding="8"><tr><td><b>1940 (Spanish) Jan</b>
<a href="./1940sp/jan/2/home.htm" target="_parent">2</a> 
<a href="./1940sp/jan/4/home.htm" target="_parent">4</a> 
<td><b>1940 (English) Jan</b>
<a href="./1940/jan/2/home.htm" target="_parent">2</a> 
<a href="./1940/jan/4/home.htm" target="_parent">4</a> 
<tr><td><b>1940 (Spanish) Feb</b>
<a href="./1940sp/feb/1/home.htm" target="_parent">1</a> 
<td><b>1940 (English) Indices</b>
<a href="./1940/ndx1/home.htm" target="_parent">Jan to Mar</a> 
</table>
"""
soup = BeautifulSoup(html,'html5lib')
table = soup.find('table')
a_tags = table.find_all('a')
for a in a_tags:
print(a.text)
输出:
2
4
2
4
1
Jan to Mar
这是它的完整版本(使用 requests
提取的 html 代码以及正确的格式):
from bs4 import BeautifulSoup
import requests
url = "http://nlpdl.nlp.gov.ph/OG01/1902"
page = requests.get(url).text
soup = BeautifulSoup(page,'html5lib')
table = soup.find('table')
a_tags = table.find_all('a')
elements = soup.find("td").find_all_next("b")
for x in range(len(elements)):
print(f"{elements[x].text}|{a_tags[x].text}")
输出:
1902 (Spanish) Sep|10
1902 (Spanish) Oct|17
1902 (Spanish) Nov|24
1902 (Spanish) Dec|1
1902 (Spanish) Indices|8
您可以使用 .find_previous('b')
找到匹配的 <b>
标签:
from bs4 import BeautifulSoup
txt = '''<table align=center border='1' cellpadding="8"><tr><td><b>1940 (Spanish) Jan</b>
<a href="./1940sp/jan/2/home.htm" target="_parent">2</a> 
<a href="./1940sp/jan/4/home.htm" target="_parent">4</a> 
<td><b>1940 (English) Jan</b>
<a href="./1940/jan/2/home.htm" target="_parent">2</a> 
<a href="./1940/jan/4/home.htm" target="_parent">4</a> 
<tr><td><b>1940 (Spanish) Feb</b>
<a href="./1940sp/feb/1/home.htm" target="_parent">1</a> 
...OMITTED...
<td><b>1940 (English) Indices</b>
<a href="./1940/ndx1/home.htm" target="_parent">Jan to Mar</a> 
</table>'''
soup = BeautifulSoup(txt, 'html.parser')
for a in soup.select('a'):
print(a.find_previous('b').text, a.text)
打印:
1940 (Spanish) Jan 2
1940 (Spanish) Jan 4
1940 (English) Jan 2
1940 (English) Jan 4
1940 (Spanish) Feb 1
1940 (English) Indices Jan to Mar
试试这个:
import requests
url = "http://nlpdl.nlp.gov.ph/OG01/1902"
page = requests.get(url)
from bs4 import BeautifulSoup
soup = BeautifulSoup(page.content, 'html.parser')
elements = soup.find("td").find_all_next("b")
links = soup.find("table").findAll("a")
for el,li in zip(elements,links):
print('{a}|{b}'.format(a=el.text,b=li.text))
我有以下 html 文件遍历 Python 的 beautifulsoup:
<table align=center border='1' cellpadding="8"><tr><td><b>1940 (Spanish) Jan</b>
<a href="./1940sp/jan/2/home.htm" target="_parent">2</a> 
<a href="./1940sp/jan/4/home.htm" target="_parent">4</a> 
<td><b>1940 (English) Jan</b>
<a href="./1940/jan/2/home.htm" target="_parent">2</a> 
<a href="./1940/jan/4/home.htm" target="_parent">4</a> 
<tr><td><b>1940 (Spanish) Feb</b>
<a href="./1940sp/feb/1/home.htm" target="_parent">1</a> 
...OMITTED...
<td><b>1940 (English) Indices</b>
<a href="./1940/ndx1/home.htm" target="_parent">Jan to Mar</a> 
</table>
这个 html 有些有关闭的 td 标签,有些没有,但我想这无关紧要。我想要得到的是 hrefs 的文本和相应的粗体文本,如下所示:
1940 (Spanish) Jan|2
1940 (Spanish) Jan|4
1940 (English) Jan|2
1940 (English) Jan|4
...
1940 (English) Indices|Jan to Mar
我实际上可以用我的代码迭代粗体 tds,我想弄清楚的是迭代 hrefs 文本的部分。我现在拥有的 python 代码如下:
import requests
url = "http://nlpdl.nlp.gov.ph/OG01/1902"
page = requests.get(url)
from bs4 import BeautifulSoup
soup = BeautifulSoup(page.content, 'html.parser')
elements = soup.find("td").find_all_next("b")
for el in elements:
print (el)
提前致谢!
这应该对您有帮助:
from bs4 import BeautifulSoup
html = """
<table align=center border='1' cellpadding="8"><tr><td><b>1940 (Spanish) Jan</b>
<a href="./1940sp/jan/2/home.htm" target="_parent">2</a> 
<a href="./1940sp/jan/4/home.htm" target="_parent">4</a> 
<td><b>1940 (English) Jan</b>
<a href="./1940/jan/2/home.htm" target="_parent">2</a> 
<a href="./1940/jan/4/home.htm" target="_parent">4</a> 
<tr><td><b>1940 (Spanish) Feb</b>
<a href="./1940sp/feb/1/home.htm" target="_parent">1</a> 
<td><b>1940 (English) Indices</b>
<a href="./1940/ndx1/home.htm" target="_parent">Jan to Mar</a> 
</table>
"""
soup = BeautifulSoup(html,'html5lib')
table = soup.find('table')
a_tags = table.find_all('a')
for a in a_tags:
print(a.text)
输出:
2
4
2
4
1
Jan to Mar
这是它的完整版本(使用 requests
提取的 html 代码以及正确的格式):
from bs4 import BeautifulSoup
import requests
url = "http://nlpdl.nlp.gov.ph/OG01/1902"
page = requests.get(url).text
soup = BeautifulSoup(page,'html5lib')
table = soup.find('table')
a_tags = table.find_all('a')
elements = soup.find("td").find_all_next("b")
for x in range(len(elements)):
print(f"{elements[x].text}|{a_tags[x].text}")
输出:
1902 (Spanish) Sep|10
1902 (Spanish) Oct|17
1902 (Spanish) Nov|24
1902 (Spanish) Dec|1
1902 (Spanish) Indices|8
您可以使用 .find_previous('b')
找到匹配的 <b>
标签:
from bs4 import BeautifulSoup
txt = '''<table align=center border='1' cellpadding="8"><tr><td><b>1940 (Spanish) Jan</b>
<a href="./1940sp/jan/2/home.htm" target="_parent">2</a> 
<a href="./1940sp/jan/4/home.htm" target="_parent">4</a> 
<td><b>1940 (English) Jan</b>
<a href="./1940/jan/2/home.htm" target="_parent">2</a> 
<a href="./1940/jan/4/home.htm" target="_parent">4</a> 
<tr><td><b>1940 (Spanish) Feb</b>
<a href="./1940sp/feb/1/home.htm" target="_parent">1</a> 
...OMITTED...
<td><b>1940 (English) Indices</b>
<a href="./1940/ndx1/home.htm" target="_parent">Jan to Mar</a> 
</table>'''
soup = BeautifulSoup(txt, 'html.parser')
for a in soup.select('a'):
print(a.find_previous('b').text, a.text)
打印:
1940 (Spanish) Jan 2
1940 (Spanish) Jan 4
1940 (English) Jan 2
1940 (English) Jan 4
1940 (Spanish) Feb 1
1940 (English) Indices Jan to Mar
试试这个:
import requests
url = "http://nlpdl.nlp.gov.ph/OG01/1902"
page = requests.get(url)
from bs4 import BeautifulSoup
soup = BeautifulSoup(page.content, 'html.parser')
elements = soup.find("td").find_all_next("b")
links = soup.find("table").findAll("a")
for el,li in zip(elements,links):
print('{a}|{b}'.format(a=el.text,b=li.text))