使用 BeautifulSoup 从 <a> 中抓取一系列链接（在其他两个标签之间）

Question

请你帮我解决 Python 中基于此 html 代码的问题：

<h2 class="sectionTitle">One</h2>
<div><a itemprop="affiliation" href="../../snapshot.asp?carId=1230559">Text1</a></div>
<div><a itemprop="affiliation" href="../../snapshot.asp?carId=1648920">Text2</a></div>
<div><a itemprop="affiliation" href="../../snapshot.asp?carId=1207230">Text3</a></div><div>
<h2 class="sectionTitle">Two</h2>

我正在尝试获取字符串（Text1、Text2 ...）以及两个 h2 标签之间的 href 链接。

通过跳转到 h2 标签（使用 string="One"）然后遍历兄弟节点直到到达下一个 h2 节点，同时抓取所有内容，抓取字符串工作正常顺便。

page = urllib.request.urlopen(url)
soup = BeautifulSoup(page, "lxml")

education = []
edu = soup.find("h2", string="One")
for elt in edu.nextSiblingGenerator():
    if elt.name == "h2":
        break
    if hasattr(elt, "text"):
        education.append(elt.text + "\n")
print("".join(education))

我无法复制它以便从附加列表中的 <a> 标签收集链接。我很业余地尝试 education2.append(elt2.get("href")) 之类的东西，但收效甚微。有什么想法吗？

谢谢！！

Answer 1

你可以试试这个：

from bs4 import BeautifulSoup as soup
l = """
<h2 class="sectionTitle">One</h2>
<div><a itemprop="affiliation" href="../../snapshot.asp?carId=1230559">Text1</a></div>
<div><a itemprop="affiliation" href="../../snapshot.asp?carId=1648920">Text2</a></div>
<div><a itemprop="affiliation" href="../../snapshot.asp?carId=1207230">Text3</a></div><div>
<h2 class="sectionTitle">Two</h2>
"""
s = soup(l, 'lxml')
final_text = [i.text for i in s.find_all('a')]

输出：

[u'Text1', u'Text2', u'Text3']

Answer 2

改进@Ajax1234 的回答；这只会找到具有 itemprop 属性的标签。参见 find_all()

from bs4 import BeautifulSoup as soup
l = """
<h2 class="sectionTitle">One</h2>
<div><a itemprop="affiliation" href="../../snapshot.asp?carId=1230559">Text1</a></div>
<div><a itemprop="affiliation" href="../../snapshot.asp?carId=1648920">Text2</a></div>
<div><a itemprop="affiliation" href="../../snapshot.asp?carId=1207230">Text3</a></div><div>
<h2 class="sectionTitle">Two</h2>
"""
s = soup(l, 'lxml')
final_text = [i.text for i in s.find_all("a", attrs={"itemprop": "affiliation"})]

Answer 3

你已经很接近做你想做的事了。我做了一些修改。

这会给你想要的：

html = '''<h2 class="sectionTitle">One</h2>
<div><a itemprop="affiliation" href="../../snapshot.asp?carId=1230559">Text1</a></div>
<div><a itemprop="affiliation" href="../../snapshot.asp?carId=1648920">Text2</a></div>
<div><a itemprop="affiliation" href="../../snapshot.asp?carId=1207230">Text3</a></div>
<div>dummy</div>
<h2 class="sectionTitle">Two</h2>'''

soup = BeautifulSoup(html, 'lxml')
texts = []
links = []
for tag in soup.find('h2', text='One').find_next_siblings():
    if tag.name == 'h2':
        break
    a = tag.find('a', itemprop='affiliation', href=True, text=True)
    if a:
        texts.append(a.text)
        links.append(a['href'])

print(texts, links, sep='\n')

输出：

['Text1', 'Text2', 'Text3']
['../../snapshot.asp?carId=1230559', '../../snapshot.asp?carId=1648920', '../../snapshot.asp?carId=1207230']

我添加了一个没有子标签的虚拟 <div> 标签，以表明代码在任何其他情况下都不会失败。

如果HTML除了你想要的itemprop="affiliation"没有任何<a>标签，你可以直接使用这个：

texts = [x.text for x in soup.find_all('a', itemprop='affiliation', text=True)]
links = [x['href'] for x in soup.find_all('a', itemprop='affiliation', href=True)]

Answer 4

我的解决方法如下：

from bs4 import BeautifulSoup
html = '''
<h2 class="sectionTitle">One</h2>
<div><a itemprop="affiliation" href="../../snapshot.asp?carId=1230559">Text1</a></div>
<div><a itemprop="affiliation" href="../../snapshot.asp?carId=1648920">Text2</a></div>
<div><a itemprop="affiliation" href="../../snapshot.asp?carId=1207230">Text3</a></div><div>
<h2 class="sectionTitle">Two</h2>
'''
soup = BeautifulSoup(html, "html.parser")

# Extract the texts
result1 = [i.text.strip('\n') for i in soup.find_all('div')]
print(result1)

# Extract the HREF links
result2 = [j['href'] for j in soup.find_all('a',href=True)]
print(result2)

列表 result1 将输出包含在 <div> 标签之间的文本列表，而列表 result2 将输出 href links 中的列表 [=] 16=] 标签。

输出：

['Text1', 'Text2', 'Text3', 'Two']
['../../snapshot.asp?carId=1230559', '../../snapshot.asp?carId=1648920', '../../snapshot.asp?carId=1207230']

希望这个解决方案能解决问题！[=17=]

使用 BeautifulSoup 从 <a> 中抓取一系列链接（在其他两个标签之间）

Scraping a series of links from <a> using BeautifulSoup (in between two other tags)

python

screen-scraping

beautifulsoup