如何从给定的 div 中提取 href?
How to extract a href from a given div?
我有以下网页的HTML代码:
<div class="align-center">
<a target="_blank" rel="nofollow" class="link-block-2 w-inline-block w-condition-invisible">
<img src="https://global.com/slack-symbol.png" alt="Slack link">
</a>
<a target="_blank" rel="nofollow" href="https://twitter.com/abc" class="link-block-2 w-inline-block">
<img src="https://global.com/twitter.png" width="16" alt="Twitter link">
</a>
<a target="_blank" rel="nofollow" href="https://t.me/abc" class="link-block-2 w-inline-block">
<img src="https://global.com/telegram.png" alt="Telegram link">
</a>
</div>
此外,我有如下 link 姓名列表:
links_dict = {}
links = ["Slack","Twitter","Telegram"]
我想为每个对应的 link 提取 href
个值。如果没有href
(见上面示例代码中的Slack),则表示没有link.
预期输出如下:
"Slack" -> "None"
"Twitter" -> "https://twitter.com/abc"
"Telegram" -> "https://t.me/abc"
我无法仅通过 a
访问 a href
,因为还有许多其他 div
元素与其他 a
.
我想将 BeautifulSoap
或 Selenium
与 PhantomJS
一起使用。这是我试过的:
美丽的肥皂:
res = requests.get("https://myurl.com")
soup = BeautifulSoup(res.content,'html.parser')
tags = soup.find_all(class_="align-center")
for tag in tags:
print tag.text.strip()
硒:
driver = webdriver.PhantomJS()
driver.set_window_size(1120, 550)
driver.get("https://mytest.com")
tags = driver.find_elements_by_class_name("align-center")
for tag in tags:
tag.find_element_by_tag_name("a").click()
url = driver.current_url
print(url)
driver.quit()
使用 BeatifulSoup 继续您的想法,您可以从每个标签中找到所有 img
link,然后检查 link 是否包含正确的 alt
模式.
如果模式正确,获取父级的link。
import re
...
links = []
tags = soup.find_all(class_="align-center")
for tag in tags:
# For each tag, get all the images
for img in tag.find_all('img'):
# Ensure the img has the correct `alt` pattern
if re.match('(Twitter|Slack|Telegram) link', img.attrs.get('alt')):
# Store the link found.
links.append(img.findParent().attrs.get('href'))
试试下面的脚本。它会为您带来想要的结果。
from bs4 import BeautifulSoup
content="""
<div class="align-center">
<a target="_blank" rel="nofollow" class="link-block-2 w-inline-block w-condition-invisible">
<img src="https://global.com/slack-symbol.png" alt="Slack link">
</a>
<a target="_blank" rel="nofollow" href="https://twitter.com/abc" class="link-block-2 w-inline-block">
<img src="https://global.com/twitter.png" width="16" alt="Twitter link">
</a>
<a target="_blank" rel="nofollow" href="https://t.me/abc" class="link-block-2 w-inline-block">
<img src="https://global.com/telegram.png" alt="Telegram link">
</a>
</div>
"""
soup = BeautifulSoup(content,"html5lib")
links = {item.get("alt").split(" ")[0]:link.get('href') for item,link in zip(soup.select(".align-center a img"),soup.select(".align-center a"))}
print(links)
输出:
{'Slack': None, 'Telegram': 'https://t.me/abc', 'Twitter': 'https://twitter.com/abc'}
或者你可以用稍微不同的方式做同样的事情:
soup = BeautifulSoup(content,"html5lib")
for item in soup.select(".align-center a img"):
title = item.get("alt").split(" ")[0]
link = item.findParent().get('href')
print(title,link)
输出:
Slack None
Twitter https://twitter.com/abc
Telegram https://t.me/abc
如要提取子节点中每个对应的 alt
属性的 href
值,可以使用 Selenium
作为根据以下代码块:
tags = driver.find_elements_by_xpath("//div[@class='align-center']/a/img")
my_alt = []
my_href= []
for tag in tags:
alt_text = tag.getAttribute("alt")
my_alt.append(alt_text)
my_href.append(driver.find_element_by_xpath("//div[@class='align-center']/a/img[.='" + alt_text + "']//preceding::a[1]").getAttribute("href"))
for alt, href in zip(my_alt, my_href):
print(alt, href)
我有以下网页的HTML代码:
<div class="align-center">
<a target="_blank" rel="nofollow" class="link-block-2 w-inline-block w-condition-invisible">
<img src="https://global.com/slack-symbol.png" alt="Slack link">
</a>
<a target="_blank" rel="nofollow" href="https://twitter.com/abc" class="link-block-2 w-inline-block">
<img src="https://global.com/twitter.png" width="16" alt="Twitter link">
</a>
<a target="_blank" rel="nofollow" href="https://t.me/abc" class="link-block-2 w-inline-block">
<img src="https://global.com/telegram.png" alt="Telegram link">
</a>
</div>
此外,我有如下 link 姓名列表:
links_dict = {}
links = ["Slack","Twitter","Telegram"]
我想为每个对应的 link 提取 href
个值。如果没有href
(见上面示例代码中的Slack),则表示没有link.
预期输出如下:
"Slack" -> "None"
"Twitter" -> "https://twitter.com/abc"
"Telegram" -> "https://t.me/abc"
我无法仅通过 a
访问 a href
,因为还有许多其他 div
元素与其他 a
.
我想将 BeautifulSoap
或 Selenium
与 PhantomJS
一起使用。这是我试过的:
美丽的肥皂:
res = requests.get("https://myurl.com")
soup = BeautifulSoup(res.content,'html.parser')
tags = soup.find_all(class_="align-center")
for tag in tags:
print tag.text.strip()
硒:
driver = webdriver.PhantomJS()
driver.set_window_size(1120, 550)
driver.get("https://mytest.com")
tags = driver.find_elements_by_class_name("align-center")
for tag in tags:
tag.find_element_by_tag_name("a").click()
url = driver.current_url
print(url)
driver.quit()
使用 BeatifulSoup 继续您的想法,您可以从每个标签中找到所有 img
link,然后检查 link 是否包含正确的 alt
模式.
如果模式正确,获取父级的link。
import re
...
links = []
tags = soup.find_all(class_="align-center")
for tag in tags:
# For each tag, get all the images
for img in tag.find_all('img'):
# Ensure the img has the correct `alt` pattern
if re.match('(Twitter|Slack|Telegram) link', img.attrs.get('alt')):
# Store the link found.
links.append(img.findParent().attrs.get('href'))
试试下面的脚本。它会为您带来想要的结果。
from bs4 import BeautifulSoup
content="""
<div class="align-center">
<a target="_blank" rel="nofollow" class="link-block-2 w-inline-block w-condition-invisible">
<img src="https://global.com/slack-symbol.png" alt="Slack link">
</a>
<a target="_blank" rel="nofollow" href="https://twitter.com/abc" class="link-block-2 w-inline-block">
<img src="https://global.com/twitter.png" width="16" alt="Twitter link">
</a>
<a target="_blank" rel="nofollow" href="https://t.me/abc" class="link-block-2 w-inline-block">
<img src="https://global.com/telegram.png" alt="Telegram link">
</a>
</div>
"""
soup = BeautifulSoup(content,"html5lib")
links = {item.get("alt").split(" ")[0]:link.get('href') for item,link in zip(soup.select(".align-center a img"),soup.select(".align-center a"))}
print(links)
输出:
{'Slack': None, 'Telegram': 'https://t.me/abc', 'Twitter': 'https://twitter.com/abc'}
或者你可以用稍微不同的方式做同样的事情:
soup = BeautifulSoup(content,"html5lib")
for item in soup.select(".align-center a img"):
title = item.get("alt").split(" ")[0]
link = item.findParent().get('href')
print(title,link)
输出:
Slack None
Twitter https://twitter.com/abc
Telegram https://t.me/abc
如要提取子节点中每个对应的 alt
属性的 href
值,可以使用 Selenium
作为根据以下代码块:
tags = driver.find_elements_by_xpath("//div[@class='align-center']/a/img")
my_alt = []
my_href= []
for tag in tags:
alt_text = tag.getAttribute("alt")
my_alt.append(alt_text)
my_href.append(driver.find_element_by_xpath("//div[@class='align-center']/a/img[.='" + alt_text + "']//preceding::a[1]").getAttribute("href"))
for alt, href in zip(my_alt, my_href):
print(alt, href)