从 FeedParser 获取提要并导入到 Pandas DataFrame

Get Feeds from FeedParser and Import to Pandas DataFrame

我正在学习 python。作为练习,我正在构建一个带有 feedparser 的 rss 抓取器,将输出放入 pandas 数据框并尝试使用 NLTK 进行挖掘......但我首先从多个 RSS 提要中获取文章列表。

我用这个 post 如何 and combined it with an answer I got previously to another question on how to get it into a

问题是什么,我希望能够看到数据框中所有提要的数据。目前我只能访问提要列表中的第一项。

FeedParser 似乎在做它的工作,但是当把它放入 Pandas df 时,它似乎只抓取列表中的第一个 RSS。

import feedparser
import pandas as pd

rawrss = [
    'http://newsrss.bbc.co.uk/rss/newsonline_uk_edition/front_page/rss.xml',
    'https://www.yahoo.com/news/rss/',
    'http://www.huffingtonpost.co.uk/feeds/index.xml',
    'http://feeds.feedburner.com/TechCrunch/',
    ]

feeds = []
for url in rawrss:
    feeds.append(feedparser.parse(url))

for feed in feeds:
    for post in feed.entries:
        print(post.title, post.link, post.summary)

df = pd.DataFrame(columns=['title', 'link', 'summary'])

for i, post in enumerate(feed.entries):
    df.loc[i] =  post.title, post.link, post.summary

df.shape

df

您的代码将遍历每个 post 并打印其数据。将 post 数据添加到数据帧的代码部分不是循环的一部分(python 缩进是有意义的!),因此您只能在数据帧中看到来自一个提要的数据。

您可以在遍历提要时构建一个 post 的列表,然后在最后创建一个数据框:

import feedparser
import pandas as pd

rawrss = [
    'http://newsrss.bbc.co.uk/rss/newsonline_uk_edition/front_page/rss.xml',
    'https://www.yahoo.com/news/rss/',
    'http://www.huffingtonpost.co.uk/feeds/index.xml',
    'http://feeds.feedburner.com/TechCrunch/',
    ]

feeds = [] # list of feed objects
for url in rawrss:
    feeds.append(feedparser.parse(url))

posts = [] # list of posts [(title1, link1, summary1), (title2, link2, summary2) ... ]
for feed in feeds:
    for post in feed.entries:
        posts.append((post.title, post.link, post.summary))

df = pd.DataFrame(posts, columns=['title', 'link', 'summary']) # pass data to init

您可以通过组合两个 for 循环来稍微优化一下:

posts = []
for url in rawrss:
    feed = feedparser.parse(url)
    for post in feed.entries:
        posts.append((post.title, post.link, post.summary))

我正在使用 dict 构建 DataFrame:

import feedparser
import pandas as pd

rawrss = [
    'http://newsrss.bbc.co.uk/rss/newsonline_uk_edition/front_page/rss.xml',
    'https://www.yahoo.com/news/rss/',
    'http://www.huffingtonpost.co.uk/feeds/index.xml',
    'http://feeds.feedburner.com/TechCrunch/',
    ]

df = pd.DataFrame([])

for url in rawrss:
    dp = feedparser.parse(url)

    for i, e in enumerate(dp.entries):
        one_feed = {}
        one_feed['etitle'] = e.title if 'title' in e else f'title {i}'
        one_feed['summary'] = e.summary if 'summary' in e else f'no summary {i}'
        one_feed['elink'] = e.link if 'link' in e else f'link {i}'
        one_feed['published'] = e.published if 'published' in e else f'no published {i}'
        one_feed['elink_img'] = e.links[1].href if 'links' in e and len(e.links)>1 else f'no link_img {i}'

        df = df.append(pd.DataFrame([one_feed]), ignore_index=True)

以这种方式添加列更容易。