使用 BS4 从 div 和 srcset 中提取图像 link
Extract an image link from within a div and srcset using BS4
html 中的示例 div 标记:
[<div class="event-info-and-content">
<picture content="https://img.example.image.link.here/954839">
<source sizes="720px" srcset="
https://img.example.image.link.here/954839 480w,
https://img.example.image.link.here/954839 600w,
https://img.example.image.link.here/954839 800w,
https://img.example.image.link.here/954839 1080w
">
<img alt="" class="event-info-and-content" data-automation="event-hero-image"/>
</source></picture>
</div>]
期望的结果(srcset):
https://img.example.image.link.here/954839
我的函数:
def extract_img_link(html):
with open(html, 'rb') as file:
content = BeautifulSoup(file)
for image in content.findAll('div', attrs={'class':'event-info-and-content'}):
print(image.get("srcset"))
return(image)
#calling out the html and function
html = 'data/website/events.html'
print(extract_img_link(html))
我的功能只是 returns 我正在寻找的整个标签,而不是 link 中的特定标签:
[<div class="event-info-and-content">
<picture content="https://img.example.image.link.here/954839">
<source sizes="720px" srcset="
https://img.example.image.link.here/954839 480w,
https://img.example.image.link.here/954839 600w,
https://img.example.image.link.here/954839 800w,
https://img.example.image.link.here/954839 1080w
">
<img alt="" class="event-info-and-content" data-automation="event-hero-image"/>
</source></picture>
</div>]
要获取图像路径,请更改您的选择并使用 <picture>
:
中的单个路径
for e in soup.select('div.event-info-and-content picture'):
print(e.get('content'))
或 <source>
:
for e in soup.select('div.event-info-and-content source'):
print(e.get('srcset').split()[0])
例子
from bs4 import BeautifulSoup
html = '''
<div class="event-info-and-content">
<picture content="https://img.example.image.link.here/954839">
<source sizes="720px" srcset="
https://img.example.image.link.here/954839 480w,
https://img.example.image.link.here/954839 600w,
https://img.example.image.link.here/954839 800w,
https://img.example.image.link.here/954839 1080w
">
<img alt="" class="event-info-and-content" data-automation="event-hero-image"/>
</source></picture>
</div>
'''
soup = BeautifulSoup(html)
for e in soup.select('div.event-info-and-content picture'):
print(e.get('content'))
输出
https://img.example.image.link.here/954839
你忘了里面还有一个额外的层,即 picture
在 div
里面
以下对我有用。
from bs4 import BeautifulSoup
def extract_img_link(html):
with open(html, 'rb') as file:
content = BeautifulSoup(file, "html.parser")
for image in content.find_all('div', attrs={'class':'event-info-and-content'}):
for picture in image.find_all('picture'):
print(picture["content"])
#calling out the html and function
html = 'data/website/events.html'
extract_img_link(html)
html 中的示例 div 标记:
[<div class="event-info-and-content">
<picture content="https://img.example.image.link.here/954839">
<source sizes="720px" srcset="
https://img.example.image.link.here/954839 480w,
https://img.example.image.link.here/954839 600w,
https://img.example.image.link.here/954839 800w,
https://img.example.image.link.here/954839 1080w
">
<img alt="" class="event-info-and-content" data-automation="event-hero-image"/>
</source></picture>
</div>]
期望的结果(srcset):
https://img.example.image.link.here/954839
我的函数:
def extract_img_link(html):
with open(html, 'rb') as file:
content = BeautifulSoup(file)
for image in content.findAll('div', attrs={'class':'event-info-and-content'}):
print(image.get("srcset"))
return(image)
#calling out the html and function
html = 'data/website/events.html'
print(extract_img_link(html))
我的功能只是 returns 我正在寻找的整个标签,而不是 link 中的特定标签:
[<div class="event-info-and-content">
<picture content="https://img.example.image.link.here/954839">
<source sizes="720px" srcset="
https://img.example.image.link.here/954839 480w,
https://img.example.image.link.here/954839 600w,
https://img.example.image.link.here/954839 800w,
https://img.example.image.link.here/954839 1080w
">
<img alt="" class="event-info-and-content" data-automation="event-hero-image"/>
</source></picture>
</div>]
要获取图像路径,请更改您的选择并使用 <picture>
:
for e in soup.select('div.event-info-and-content picture'):
print(e.get('content'))
或 <source>
:
for e in soup.select('div.event-info-and-content source'):
print(e.get('srcset').split()[0])
例子
from bs4 import BeautifulSoup
html = '''
<div class="event-info-and-content">
<picture content="https://img.example.image.link.here/954839">
<source sizes="720px" srcset="
https://img.example.image.link.here/954839 480w,
https://img.example.image.link.here/954839 600w,
https://img.example.image.link.here/954839 800w,
https://img.example.image.link.here/954839 1080w
">
<img alt="" class="event-info-and-content" data-automation="event-hero-image"/>
</source></picture>
</div>
'''
soup = BeautifulSoup(html)
for e in soup.select('div.event-info-and-content picture'):
print(e.get('content'))
输出
https://img.example.image.link.here/954839
你忘了里面还有一个额外的层,即 picture
在 div
以下对我有用。
from bs4 import BeautifulSoup
def extract_img_link(html):
with open(html, 'rb') as file:
content = BeautifulSoup(file, "html.parser")
for image in content.find_all('div', attrs={'class':'event-info-and-content'}):
for picture in image.find_all('picture'):
print(picture["content"])
#calling out the html and function
html = 'data/website/events.html'
extract_img_link(html)