使用 FeedParser 导入 RSS 并将帖子和一般信息获取到单个 Pandas DataFrame
Import RSS with FeedParser and Get Both Posts and General Information to Single Pandas DataFrame
作为 python 新手,我正在练习在 python 中导入数据。最后,我想分析来自不同播客的数据(播客本身的信息 和 每集),方法是将数据放入一个连贯的数据帧中,使用 NLP 对其进行处理。
到目前为止,我已经设法阅读了一个 RSS 提要列表,并获得了 RSS 提要的每一集的信息 (post)。
但我很难在 python 中找到一个 集成的 工作流程来收集两者
- 关于 RSS 提要的每一集的信息(post)
- 以及有关 RSS 提要的一般信息(如播客的标题)
一气呵成。
代码
这是我目前得到的
import feedparser
import pandas as pd
rss_feeds = ['http://feeds.feedburner.com/TEDTalks_audio',
'https://joelhooks.com/rss.xml',
'https://www.sciencemag.org/rss/podcast.xml',
]
#number of feeds is reduced for testing
posts = []
feed = []
for url in rss_feeds:
feed = feedparser.parse(url)
for post in feed.entries:
posts.append((post.title, post.link, post.summary))
df = pd.DataFrame(posts, columns=['title', 'link', 'summary'])
输出
数据框包括 652 non-null objects 三列(如预期的那样)——基本上每个播客中的每个 post。 title 列指的是剧集的标题,但 not 指的是播客的标题(在本例中是 'Ted Talk Daily') .
title
link
summary
0
3 questions to ask yourself about everything y...
https://www.ted.com/talks/stacey_abrams_3_ques...
How you respond to setbacks is what defines yo...
1
What your sleep patterns say about your relati...
https://www.ted.com/talks/tedx_shorts_what_you...
Wendy Troxel looks at the cultural expectation...
2
How we can actually pay people enough -- with ...
https://www.ted.com/talks/ted_business_how_we_...
Capitalism urgently needs an upgrade, says Pay...
我也在努力寻找一种方法将播客的标题也包含到这个数据框中。我总是在选择整个提要信息的部分时出错,例如['feed']['title'].
感谢您的每一个提示!
来源
到目前为止,我已经习惯了基于此来源的内容:
在这种情况下可以使用 feed.feed.title
:
访问 Feed 标题
# ...
for url in rss_feeds:
feed = feedparser.parse(url)
for post in feed.entries:
posts.append((feed.feed.title, post.title, post.link, post.summary))
df = pd.DataFrame(posts, columns=['feed_title', 'title', 'link', 'summary'])
df
输出:
feed_title title link summary
0 TED Talks Daily 3 ways compa... https://www.... When we expe...
1 TED Talks Daily How we could... https://www.... Concrete is ...
2 TED Talks Daily 3 questions ... https://www.... How you resp...
3 TED Talks Daily What your sl... https://www.... Wendy Troxel...
4 TED Talks Daily How we can a... https://www.... Capitalism u...
.. ... ... ... ...
649 Science Maga... Science Podc... https://traf... Fear-enhance...
650 Science Maga... Science Podc... https://traf... Discussing t...
651 Science Maga... Science Podc... https://traf... Talking kids...
652 Science Maga... Science Podc... https://traf... The minimum ...
653 Science Maga... Science Podc... https://traf... The origin o...
作为 python 新手,我正在练习在 python 中导入数据。最后,我想分析来自不同播客的数据(播客本身的信息 和 每集),方法是将数据放入一个连贯的数据帧中,使用 NLP 对其进行处理。
到目前为止,我已经设法阅读了一个 RSS 提要列表,并获得了 RSS 提要的每一集的信息 (post)。
但我很难在 python 中找到一个 集成的 工作流程来收集两者
- 关于 RSS 提要的每一集的信息(post)
- 以及有关 RSS 提要的一般信息(如播客的标题) 一气呵成。
代码 这是我目前得到的
import feedparser
import pandas as pd
rss_feeds = ['http://feeds.feedburner.com/TEDTalks_audio',
'https://joelhooks.com/rss.xml',
'https://www.sciencemag.org/rss/podcast.xml',
]
#number of feeds is reduced for testing
posts = []
feed = []
for url in rss_feeds:
feed = feedparser.parse(url)
for post in feed.entries:
posts.append((post.title, post.link, post.summary))
df = pd.DataFrame(posts, columns=['title', 'link', 'summary'])
输出 数据框包括 652 non-null objects 三列(如预期的那样)——基本上每个播客中的每个 post。 title 列指的是剧集的标题,但 not 指的是播客的标题(在本例中是 'Ted Talk Daily') .
title | link | summary | |
---|---|---|---|
0 | 3 questions to ask yourself about everything y... | https://www.ted.com/talks/stacey_abrams_3_ques... | How you respond to setbacks is what defines yo... |
1 | What your sleep patterns say about your relati... | https://www.ted.com/talks/tedx_shorts_what_you... | Wendy Troxel looks at the cultural expectation... |
2 | How we can actually pay people enough -- with ... | https://www.ted.com/talks/ted_business_how_we_... | Capitalism urgently needs an upgrade, says Pay... |
我也在努力寻找一种方法将播客的标题也包含到这个数据框中。我总是在选择整个提要信息的部分时出错,例如['feed']['title'].
感谢您的每一个提示!
来源
到目前为止,我已经习惯了基于此来源的内容:
在这种情况下可以使用 feed.feed.title
:
# ...
for url in rss_feeds:
feed = feedparser.parse(url)
for post in feed.entries:
posts.append((feed.feed.title, post.title, post.link, post.summary))
df = pd.DataFrame(posts, columns=['feed_title', 'title', 'link', 'summary'])
df
输出:
feed_title title link summary
0 TED Talks Daily 3 ways compa... https://www.... When we expe...
1 TED Talks Daily How we could... https://www.... Concrete is ...
2 TED Talks Daily 3 questions ... https://www.... How you resp...
3 TED Talks Daily What your sl... https://www.... Wendy Troxel...
4 TED Talks Daily How we can a... https://www.... Capitalism u...
.. ... ... ... ...
649 Science Maga... Science Podc... https://traf... Fear-enhance...
650 Science Maga... Science Podc... https://traf... Discussing t...
651 Science Maga... Science Podc... https://traf... Talking kids...
652 Science Maga... Science Podc... https://traf... The minimum ...
653 Science Maga... Science Podc... https://traf... The origin o...