我想在 RSS 提要描述标签中获取图像 link

Question

我想获取 RSS 提要描述标签内的图像 link。

使用 feedparser 获得描述中的值 tag.But 我想获得该标签内的图像 link。

<description><![CDATA[<div class="K2FeedImage"><img src="https://srilankamirror.com/media/k2/items/cache/25a3bb259efa21fc96901ad625f3a85d_S.jpg" alt="MP Piyasena sentenced to 4 years in prison" /></div><div class="K2FeedIntroText"><p>Former Tamil National Alliance (TNA) parliamentarian, P. Piyasena has been sentenced to 4 years in prison and fined Rs.</p>
</div><div class="K2FeedFullText">
<p>5.4 million for using state-owned vehicle for an year after losing his parliamentary seat.</p></div>]]></description>

然后我按照他的方式尝试使用 python 中的子字符串。

import re

text =  "<![CDATA[<img src='https://adaderanaenglish.s3.amazonaws.com/' width='60' align='left' hspace='5'/>Former Tamil National Alliance (TNA) MP P. Piyasena had been sentenced to 4 years in prison over a case of misusing a state vehicle after losing his MP post. MORE..]]>"

match = re.search("<img src=\"(.+?) \"", text, flags=re.IGNORECASE)

try:
    result = match.group(1)
except:
    result = "no match found"

print(result)

C:/Users/ASUS/Desktop/untitled/a.py

no match found

进程已完成，退出代码为 0

Answer 1

您需要稍微更改正则表达式才能使其正常工作。你想要的是在 src= 之后立即获取内容，并在遇到 ' 字符时立即停止（惰性搜索）。所以，你的正则表达式应该是：

match = re.search("src='+(.*?)'",text)

您可以访问 this 来帮助您使用正则表达式。

Answer 2

你也可以使用split。这完全取决于您是否已经像您在问题中提到的那样隔离了正确的标签。因此，您正在使用 text.

text = '''
<description><![CDATA[<div class="K2FeedImage"><img src="https://srilankamirror.com/media/k2/items/cache/25a3bb259efa21fc96901ad625f3a85d_S.jpg" alt="MP Piyasena sentenced to 4 years in prison" /></div><div class="K2FeedIntroText"><p>Former Tamil National Alliance (TNA) parliamentarian, P. Piyasena has been sentenced to 4 years in prison and fined Rs.</p>
</div><div class="K2FeedFullText">
<p>5.4 million for using state-owned vehicle for an year after losing his parliamentary seat.</p></div>]]></description>
'''

link = text.split('src="')[1].split('"')[0] 
print(link)

Answer 3

不用正则就可以得到图像link expression.Try下面code.It会先找到next_element然后再去汤里得到图像link.

from bs4 import BeautifulSoup

data='''<description><![CDATA[<div class="K2FeedImage"><img src="https://srilankamirror.com/media/k2/items/cache/25a3bb259efa21fc96901ad625f3a85d_S.jpg" alt="MP Piyasena sentenced to 4 years in prison" /></div><div class="K2FeedIntroText"><p>Former Tamil National Alliance (TNA) parliamentarian, P. Piyasena has been sentenced to 4 years in prison and fined Rs.</p>
</div><div class="K2FeedFullText">
<p>5.4 million for using state-owned vehicle for an year after losing his parliamentary seat.</p></div>]]></description>'''

soup=BeautifulSoup(data,'html.parser')
item=soup.find('description')
data1=item.next_element
soup1=BeautifulSoup(data1,'html.parser')
print(soup1.find('img')['src'])

输出：

https://srilankamirror.com/media/k2/items/cache/25a3bb259efa21fc96901ad625f3a85d_S.jpg

我想在 RSS 提要描述标签中获取图像 link

I want to get the image link inside a RSS feed description tag

html

python

substring

beautifulsoup

rss-reader