如何从给定的 div 中提取 href?

How to extract a href from a given div?

我有以下网页的HTML代码:

<div class="align-center">
<a target="_blank" rel="nofollow" class="link-block-2 w-inline-block w-condition-invisible">
  <img src="https://global.com/slack-symbol.png" alt="Slack link">
</a>
<a target="_blank" rel="nofollow" href="https://twitter.com/abc" class="link-block-2 w-inline-block">
  <img src="https://global.com/twitter.png" width="16" alt="Twitter link">
</a>
<a target="_blank" rel="nofollow" href="https://t.me/abc" class="link-block-2 w-inline-block">
  <img src="https://global.com/telegram.png" alt="Telegram link">
</a>
</div>

此外,我有如下 link 姓名列表:

links_dict = {}
links = ["Slack","Twitter","Telegram"] 

我想为每个对应的 link 提取 href 个值。如果没有href(见上面示例代码中的Slack),则表示没有link.

预期输出如下:

"Slack" -> "None"
"Twitter" -> "https://twitter.com/abc"
"Telegram" -> "https://t.me/abc"

我无法仅通过 a 访问 a href,因为还有许多其他 div 元素与其他 a.

我想将 BeautifulSoapSeleniumPhantomJS 一起使用。这是我试过的:

美丽的肥皂:

res = requests.get("https://myurl.com")
soup = BeautifulSoup(res.content,'html.parser')
tags = soup.find_all(class_="align-center")
for tag in tags:
    print tag.text.strip()

硒:

driver = webdriver.PhantomJS()
driver.set_window_size(1120, 550)
driver.get("https://mytest.com")

tags = driver.find_elements_by_class_name("align-center")

for tag in tags:
    tag.find_element_by_tag_name("a").click()
    url = driver.current_url
    print(url)
driver.quit()

使用 BeatifulSoup 继续您的想法,您可以从每个标签中找到所有 img link,然后检查 link 是否包含正确的 alt 模式.

如果模式正确,获取父级的link。

import re

...

links = []
tags = soup.find_all(class_="align-center")
for tag in tags:
    # For each tag, get all the images
    for img in tag.find_all('img'):
        # Ensure the img has the correct `alt` pattern
        if re.match('(Twitter|Slack|Telegram) link', img.attrs.get('alt')):
            # Store the link found.
            links.append(img.findParent().attrs.get('href'))

试试下面的脚本。它会为您带来想要的结果。

from bs4 import BeautifulSoup

content="""
<div class="align-center">
<a target="_blank" rel="nofollow" class="link-block-2 w-inline-block w-condition-invisible">
  <img src="https://global.com/slack-symbol.png" alt="Slack link">
</a>
<a target="_blank" rel="nofollow" href="https://twitter.com/abc" class="link-block-2 w-inline-block">
  <img src="https://global.com/twitter.png" width="16" alt="Twitter link">
</a>
<a target="_blank" rel="nofollow" href="https://t.me/abc" class="link-block-2 w-inline-block">
  <img src="https://global.com/telegram.png" alt="Telegram link">
</a>
</div>
"""
soup = BeautifulSoup(content,"html5lib")
links = {item.get("alt").split(" ")[0]:link.get('href') for item,link in zip(soup.select(".align-center a img"),soup.select(".align-center a"))}
print(links)

输出:

{'Slack': None, 'Telegram': 'https://t.me/abc', 'Twitter': 'https://twitter.com/abc'}

或者你可以用稍微不同的方式做同样的事情:

soup = BeautifulSoup(content,"html5lib")
for item in soup.select(".align-center a img"):
    title = item.get("alt").split(" ")[0]
    link = item.findParent().get('href')
    print(title,link)

输出:

Slack None
Twitter https://twitter.com/abc
Telegram https://t.me/abc

如要提取子节点中每个对应的 alt 属性的 href 值,可以使用 Selenium 作为根据以下代码块:

tags = driver.find_elements_by_xpath("//div[@class='align-center']/a/img")
my_alt = []
my_href= []
for tag in tags:
    alt_text = tag.getAttribute("alt")
    my_alt.append(alt_text)
    my_href.append(driver.find_element_by_xpath("//div[@class='align-center']/a/img[.='" + alt_text + "']//preceding::a[1]").getAttribute("href"))
for alt, href in zip(my_alt, my_href):
    print(alt, href)