使用 BeautifulSoup 从 <a> 中抓取一系列链接(在其他两个标签之间)
Scraping a series of links from <a> using BeautifulSoup (in between two other tags)
请你帮我解决 Python 中基于此 html 代码的问题:
<h2 class="sectionTitle">One</h2>
<div><a itemprop="affiliation" href="../../snapshot.asp?carId=1230559">Text1</a></div>
<div><a itemprop="affiliation" href="../../snapshot.asp?carId=1648920">Text2</a></div>
<div><a itemprop="affiliation" href="../../snapshot.asp?carId=1207230">Text3</a></div><div>
<h2 class="sectionTitle">Two</h2>
我正在尝试获取字符串(Text1、Text2 ...)以及两个 h2
标签之间的 href 链接。
通过跳转到 h2
标签(使用 string="One")然后遍历兄弟节点直到到达下一个 h2
节点,同时抓取所有内容,抓取字符串工作正常顺便。
page = urllib.request.urlopen(url)
soup = BeautifulSoup(page, "lxml")
education = []
edu = soup.find("h2", string="One")
for elt in edu.nextSiblingGenerator():
if elt.name == "h2":
break
if hasattr(elt, "text"):
education.append(elt.text + "\n")
print("".join(education))
我无法复制它以便从附加列表中的 <a>
标签收集链接。我很业余地尝试 education2.append(elt2.get("href")) 之类的东西,但收效甚微。有什么想法吗?
谢谢!!
你可以试试这个:
from bs4 import BeautifulSoup as soup
l = """
<h2 class="sectionTitle">One</h2>
<div><a itemprop="affiliation" href="../../snapshot.asp?carId=1230559">Text1</a></div>
<div><a itemprop="affiliation" href="../../snapshot.asp?carId=1648920">Text2</a></div>
<div><a itemprop="affiliation" href="../../snapshot.asp?carId=1207230">Text3</a></div><div>
<h2 class="sectionTitle">Two</h2>
"""
s = soup(l, 'lxml')
final_text = [i.text for i in s.find_all('a')]
输出:
[u'Text1', u'Text2', u'Text3']
改进@Ajax1234 的回答;这只会找到具有 itemprop
属性的标签。参见 find_all()
from bs4 import BeautifulSoup as soup
l = """
<h2 class="sectionTitle">One</h2>
<div><a itemprop="affiliation" href="../../snapshot.asp?carId=1230559">Text1</a></div>
<div><a itemprop="affiliation" href="../../snapshot.asp?carId=1648920">Text2</a></div>
<div><a itemprop="affiliation" href="../../snapshot.asp?carId=1207230">Text3</a></div><div>
<h2 class="sectionTitle">Two</h2>
"""
s = soup(l, 'lxml')
final_text = [i.text for i in s.find_all("a", attrs={"itemprop": "affiliation"})]
你已经很接近做你想做的事了。我做了一些修改。
这会给你想要的:
html = '''<h2 class="sectionTitle">One</h2>
<div><a itemprop="affiliation" href="../../snapshot.asp?carId=1230559">Text1</a></div>
<div><a itemprop="affiliation" href="../../snapshot.asp?carId=1648920">Text2</a></div>
<div><a itemprop="affiliation" href="../../snapshot.asp?carId=1207230">Text3</a></div>
<div>dummy</div>
<h2 class="sectionTitle">Two</h2>'''
soup = BeautifulSoup(html, 'lxml')
texts = []
links = []
for tag in soup.find('h2', text='One').find_next_siblings():
if tag.name == 'h2':
break
a = tag.find('a', itemprop='affiliation', href=True, text=True)
if a:
texts.append(a.text)
links.append(a['href'])
print(texts, links, sep='\n')
输出:
['Text1', 'Text2', 'Text3']
['../../snapshot.asp?carId=1230559', '../../snapshot.asp?carId=1648920', '../../snapshot.asp?carId=1207230']
我添加了一个没有子标签的虚拟 <div>
标签,以表明代码在任何其他情况下都不会失败。
如果HTML除了你想要的itemprop="affiliation"
没有任何<a>
标签,你可以直接使用这个:
texts = [x.text for x in soup.find_all('a', itemprop='affiliation', text=True)]
links = [x['href'] for x in soup.find_all('a', itemprop='affiliation', href=True)]
我的解决方法如下:
from bs4 import BeautifulSoup
html = '''
<h2 class="sectionTitle">One</h2>
<div><a itemprop="affiliation" href="../../snapshot.asp?carId=1230559">Text1</a></div>
<div><a itemprop="affiliation" href="../../snapshot.asp?carId=1648920">Text2</a></div>
<div><a itemprop="affiliation" href="../../snapshot.asp?carId=1207230">Text3</a></div><div>
<h2 class="sectionTitle">Two</h2>
'''
soup = BeautifulSoup(html, "html.parser")
# Extract the texts
result1 = [i.text.strip('\n') for i in soup.find_all('div')]
print(result1)
# Extract the HREF links
result2 = [j['href'] for j in soup.find_all('a',href=True)]
print(result2)
列表 result1
将输出包含在 <div>
标签之间的文本列表,而列表 result2
将输出 href links
中的列表 [=] 16=] 标签。
输出:
['Text1', 'Text2', 'Text3', 'Two']
['../../snapshot.asp?carId=1230559', '../../snapshot.asp?carId=1648920', '../../snapshot.asp?carId=1207230']
希望这个解决方案能解决问题![=17=]
请你帮我解决 Python 中基于此 html 代码的问题:
<h2 class="sectionTitle">One</h2>
<div><a itemprop="affiliation" href="../../snapshot.asp?carId=1230559">Text1</a></div>
<div><a itemprop="affiliation" href="../../snapshot.asp?carId=1648920">Text2</a></div>
<div><a itemprop="affiliation" href="../../snapshot.asp?carId=1207230">Text3</a></div><div>
<h2 class="sectionTitle">Two</h2>
我正在尝试获取字符串(Text1、Text2 ...)以及两个 h2
标签之间的 href 链接。
通过跳转到 h2
标签(使用 string="One")然后遍历兄弟节点直到到达下一个 h2
节点,同时抓取所有内容,抓取字符串工作正常顺便。
page = urllib.request.urlopen(url)
soup = BeautifulSoup(page, "lxml")
education = []
edu = soup.find("h2", string="One")
for elt in edu.nextSiblingGenerator():
if elt.name == "h2":
break
if hasattr(elt, "text"):
education.append(elt.text + "\n")
print("".join(education))
我无法复制它以便从附加列表中的 <a>
标签收集链接。我很业余地尝试 education2.append(elt2.get("href")) 之类的东西,但收效甚微。有什么想法吗?
谢谢!!
你可以试试这个:
from bs4 import BeautifulSoup as soup
l = """
<h2 class="sectionTitle">One</h2>
<div><a itemprop="affiliation" href="../../snapshot.asp?carId=1230559">Text1</a></div>
<div><a itemprop="affiliation" href="../../snapshot.asp?carId=1648920">Text2</a></div>
<div><a itemprop="affiliation" href="../../snapshot.asp?carId=1207230">Text3</a></div><div>
<h2 class="sectionTitle">Two</h2>
"""
s = soup(l, 'lxml')
final_text = [i.text for i in s.find_all('a')]
输出:
[u'Text1', u'Text2', u'Text3']
改进@Ajax1234 的回答;这只会找到具有 itemprop
属性的标签。参见 find_all()
from bs4 import BeautifulSoup as soup
l = """
<h2 class="sectionTitle">One</h2>
<div><a itemprop="affiliation" href="../../snapshot.asp?carId=1230559">Text1</a></div>
<div><a itemprop="affiliation" href="../../snapshot.asp?carId=1648920">Text2</a></div>
<div><a itemprop="affiliation" href="../../snapshot.asp?carId=1207230">Text3</a></div><div>
<h2 class="sectionTitle">Two</h2>
"""
s = soup(l, 'lxml')
final_text = [i.text for i in s.find_all("a", attrs={"itemprop": "affiliation"})]
你已经很接近做你想做的事了。我做了一些修改。
这会给你想要的:
html = '''<h2 class="sectionTitle">One</h2>
<div><a itemprop="affiliation" href="../../snapshot.asp?carId=1230559">Text1</a></div>
<div><a itemprop="affiliation" href="../../snapshot.asp?carId=1648920">Text2</a></div>
<div><a itemprop="affiliation" href="../../snapshot.asp?carId=1207230">Text3</a></div>
<div>dummy</div>
<h2 class="sectionTitle">Two</h2>'''
soup = BeautifulSoup(html, 'lxml')
texts = []
links = []
for tag in soup.find('h2', text='One').find_next_siblings():
if tag.name == 'h2':
break
a = tag.find('a', itemprop='affiliation', href=True, text=True)
if a:
texts.append(a.text)
links.append(a['href'])
print(texts, links, sep='\n')
输出:
['Text1', 'Text2', 'Text3']
['../../snapshot.asp?carId=1230559', '../../snapshot.asp?carId=1648920', '../../snapshot.asp?carId=1207230']
我添加了一个没有子标签的虚拟 <div>
标签,以表明代码在任何其他情况下都不会失败。
如果HTML除了你想要的itemprop="affiliation"
没有任何<a>
标签,你可以直接使用这个:
texts = [x.text for x in soup.find_all('a', itemprop='affiliation', text=True)]
links = [x['href'] for x in soup.find_all('a', itemprop='affiliation', href=True)]
我的解决方法如下:
from bs4 import BeautifulSoup
html = '''
<h2 class="sectionTitle">One</h2>
<div><a itemprop="affiliation" href="../../snapshot.asp?carId=1230559">Text1</a></div>
<div><a itemprop="affiliation" href="../../snapshot.asp?carId=1648920">Text2</a></div>
<div><a itemprop="affiliation" href="../../snapshot.asp?carId=1207230">Text3</a></div><div>
<h2 class="sectionTitle">Two</h2>
'''
soup = BeautifulSoup(html, "html.parser")
# Extract the texts
result1 = [i.text.strip('\n') for i in soup.find_all('div')]
print(result1)
# Extract the HREF links
result2 = [j['href'] for j in soup.find_all('a',href=True)]
print(result2)
列表 result1
将输出包含在 <div>
标签之间的文本列表,而列表 result2
将输出 href links
中的列表 [=] 16=] 标签。
输出:
['Text1', 'Text2', 'Text3', 'Two']
['../../snapshot.asp?carId=1230559', '../../snapshot.asp?carId=1648920', '../../snapshot.asp?carId=1207230']
希望这个解决方案能解决问题![=17=]