使用 feedparser 访问重复的 feed 标签
Accessing duplicate feed tags using feedparser
我正在尝试解析此提要:https://feeds.podcastmirror.com/dudesanddadspodcast
channel
部分有两个条目 podcast:person
<podcast:person role="host" img="https://dudesanddadspodcast.com/files/2019/03/andy.jpg" href="https://www.podchaser.com/creators/andy-lehman-107aRuVQLA">Andy Lehman</podcast:person>
<podcast:person role="host" img="https://dudesanddadspodcast.com/files/2019/03/joel.jpg" href="https://www.podchaser.com/creators/joel-demott-107aRuVQLH" >Joel DeMott</podcast:person>
解析时,feedparser 只带入一个名字
> import feedparser
> d = feedparser.parse('https://feeds.podcastmirror.com/dudesanddadspodcast')
> d.feed['podcast_person']
> {'role': 'host', 'img': 'https://dudesanddadspodcast.com/files/2019/03/joel.jpg', 'href': 'https://www.podchaser.com/creators/joel-demott-107aRuVQLH'}
我要更改什么以便显示 podcast_person
的列表以便我可以遍历每个列表?
想法 #1:
from bs4 import BeautifulSoup
import requests
r = requests.get("https://feeds.podcastmirror.com/dudesanddadspodcast").content
soup = BeautifulSoup(r, 'html.parser')
soup.find_all("podcast:person")
输出:
[<podcast:person href="https://www.podchaser.com/creators/andy-lehman-107aRuVQLA" img="https://dudesanddadspodcast.com/files/2019/03/andy.jpg" role="host">Andy Lehman</podcast:person>,
<podcast:person href="https://www.podchaser.com/creators/joel-demott-107aRuVQLH" img="https://dudesanddadspodcast.com/files/2019/03/joel.jpg" role="host">Joel DeMott</podcast:person>,
<podcast:person href="https://www.podchaser.com/creators/cory-martin-107aRwmCuu" img="" role="guest">Cory Martin</podcast:person>,
<podcast:person href="https://www.podchaser.com/creators/julie-lehman-107aRuVQPL" img="" role="guest">Julie Lehman</podcast:person>]
想法 #2:
导入 feedparser
d = feedparser.parse('https://feeds.podcastmirror.com/dudesanddadspodcast')
hosts = d.entries[1]['authors'][1]['name'].split(", ")
print("The hosts of this Podcast are {} and {}.".format(hosts[0], hosts[1]))
输出:
The hosts of this Podcast are Joel DeMott and Andy Lehman.
您可以遍历 feed['items']
并获取所有记录。
import feedparser
feed = feedparser.parse('https://feeds.podcastmirror.com/dudesanddadspodcast')
if feed:
for item in feed['items']:
print(f'{item["title"]} - {item["author"]}')
因为我熟悉lxml
,考虑到有人已经发布了使用feedparser
的解决方案,我想测试一下lxml
如何被用来解析RSS 提要。在我看来,令人生畏的部分是 RSS namespaces 的处理,但是一旦解决了这个问题,任务就变得很容易了:
import urllib.request
from lxml import etree
feed = etree.parse(urllib.request.urlopen('https://feeds.podcastmirror.com/dudesanddadspodcast')).getroot()
namespaces = {
'itunes': 'http://www.itunes.com/dtds/podcast-1.0.dtd'
}
for episode in feed.iter('item'):
# print(etree.tostring(episode))
authors = episode.xpath('itunes:author/text()', namespaces=namespaces)
print(authors)
#title = episode.xpath('itunes:title/text()', namespaces=namespaces)
#episode_metadata = '{} - {}'.format(title[0] if title else 'Missing title', authors[0] if authors else 'Missing authors')
#print(episode_metadata)
与使用 feedparser
的类似解决方案相比,上述代码的执行时间快了近 3 倍,反映了使用 lxml
的性能提升作为解析库。
而不是 feedparser
我更喜欢 BeautifulSoup
.
您可以复制以下代码来测试最终结果。
from bs4 import BeautifulSoup
import requests
r = requests.get("https://feeds.podcastmirror.com/dudesanddadspodcast").content
soup = BeautifulSoup(r, 'html.parser')
feeds = soup.find_all("podcast:person")
print(type(feeds)) # <list>
# You can loop the `feeds` variable.
我正在尝试解析此提要:https://feeds.podcastmirror.com/dudesanddadspodcast
channel
部分有两个条目 podcast:person
<podcast:person role="host" img="https://dudesanddadspodcast.com/files/2019/03/andy.jpg" href="https://www.podchaser.com/creators/andy-lehman-107aRuVQLA">Andy Lehman</podcast:person>
<podcast:person role="host" img="https://dudesanddadspodcast.com/files/2019/03/joel.jpg" href="https://www.podchaser.com/creators/joel-demott-107aRuVQLH" >Joel DeMott</podcast:person>
解析时,feedparser 只带入一个名字
> import feedparser
> d = feedparser.parse('https://feeds.podcastmirror.com/dudesanddadspodcast')
> d.feed['podcast_person']
> {'role': 'host', 'img': 'https://dudesanddadspodcast.com/files/2019/03/joel.jpg', 'href': 'https://www.podchaser.com/creators/joel-demott-107aRuVQLH'}
我要更改什么以便显示 podcast_person
的列表以便我可以遍历每个列表?
想法 #1:
from bs4 import BeautifulSoup
import requests
r = requests.get("https://feeds.podcastmirror.com/dudesanddadspodcast").content
soup = BeautifulSoup(r, 'html.parser')
soup.find_all("podcast:person")
输出:
[<podcast:person href="https://www.podchaser.com/creators/andy-lehman-107aRuVQLA" img="https://dudesanddadspodcast.com/files/2019/03/andy.jpg" role="host">Andy Lehman</podcast:person>,
<podcast:person href="https://www.podchaser.com/creators/joel-demott-107aRuVQLH" img="https://dudesanddadspodcast.com/files/2019/03/joel.jpg" role="host">Joel DeMott</podcast:person>,
<podcast:person href="https://www.podchaser.com/creators/cory-martin-107aRwmCuu" img="" role="guest">Cory Martin</podcast:person>,
<podcast:person href="https://www.podchaser.com/creators/julie-lehman-107aRuVQPL" img="" role="guest">Julie Lehman</podcast:person>]
想法 #2:
导入 feedparser
d = feedparser.parse('https://feeds.podcastmirror.com/dudesanddadspodcast')
hosts = d.entries[1]['authors'][1]['name'].split(", ")
print("The hosts of this Podcast are {} and {}.".format(hosts[0], hosts[1]))
输出:
The hosts of this Podcast are Joel DeMott and Andy Lehman.
您可以遍历 feed['items']
并获取所有记录。
import feedparser
feed = feedparser.parse('https://feeds.podcastmirror.com/dudesanddadspodcast')
if feed:
for item in feed['items']:
print(f'{item["title"]} - {item["author"]}')
因为我熟悉lxml
,考虑到有人已经发布了使用feedparser
的解决方案,我想测试一下lxml
如何被用来解析RSS 提要。在我看来,令人生畏的部分是 RSS namespaces 的处理,但是一旦解决了这个问题,任务就变得很容易了:
import urllib.request
from lxml import etree
feed = etree.parse(urllib.request.urlopen('https://feeds.podcastmirror.com/dudesanddadspodcast')).getroot()
namespaces = {
'itunes': 'http://www.itunes.com/dtds/podcast-1.0.dtd'
}
for episode in feed.iter('item'):
# print(etree.tostring(episode))
authors = episode.xpath('itunes:author/text()', namespaces=namespaces)
print(authors)
#title = episode.xpath('itunes:title/text()', namespaces=namespaces)
#episode_metadata = '{} - {}'.format(title[0] if title else 'Missing title', authors[0] if authors else 'Missing authors')
#print(episode_metadata)
与使用 feedparser
的类似解决方案相比,上述代码的执行时间快了近 3 倍,反映了使用 lxml
的性能提升作为解析库。
而不是 feedparser
我更喜欢 BeautifulSoup
.
您可以复制以下代码来测试最终结果。
from bs4 import BeautifulSoup
import requests
r = requests.get("https://feeds.podcastmirror.com/dudesanddadspodcast").content
soup = BeautifulSoup(r, 'html.parser')
feeds = soup.find_all("podcast:person")
print(type(feeds)) # <list>
# You can loop the `feeds` variable.