从 FeedParser 获取提要并导入到 Pandas DataFrame
Get Feeds from FeedParser and Import to Pandas DataFrame
我正在学习 python。作为练习,我正在构建一个带有 feedparser 的 rss 抓取器,将输出放入 pandas 数据框并尝试使用 NLTK 进行挖掘......但我首先从多个 RSS 提要中获取文章列表。
我用这个 post 如何 and combined it with an answer I got previously to another question on how to get it into a 。
问题是什么,我希望能够看到数据框中所有提要的数据。目前我只能访问提要列表中的第一项。
FeedParser 似乎在做它的工作,但是当把它放入 Pandas df 时,它似乎只抓取列表中的第一个 RSS。
import feedparser
import pandas as pd
rawrss = [
'http://newsrss.bbc.co.uk/rss/newsonline_uk_edition/front_page/rss.xml',
'https://www.yahoo.com/news/rss/',
'http://www.huffingtonpost.co.uk/feeds/index.xml',
'http://feeds.feedburner.com/TechCrunch/',
]
feeds = []
for url in rawrss:
feeds.append(feedparser.parse(url))
for feed in feeds:
for post in feed.entries:
print(post.title, post.link, post.summary)
df = pd.DataFrame(columns=['title', 'link', 'summary'])
for i, post in enumerate(feed.entries):
df.loc[i] = post.title, post.link, post.summary
df.shape
df
您的代码将遍历每个 post 并打印其数据。将 post 数据添加到数据帧的代码部分不是循环的一部分(python 缩进是有意义的!),因此您只能在数据帧中看到来自一个提要的数据。
您可以在遍历提要时构建一个 post 的列表,然后在最后创建一个数据框:
import feedparser
import pandas as pd
rawrss = [
'http://newsrss.bbc.co.uk/rss/newsonline_uk_edition/front_page/rss.xml',
'https://www.yahoo.com/news/rss/',
'http://www.huffingtonpost.co.uk/feeds/index.xml',
'http://feeds.feedburner.com/TechCrunch/',
]
feeds = [] # list of feed objects
for url in rawrss:
feeds.append(feedparser.parse(url))
posts = [] # list of posts [(title1, link1, summary1), (title2, link2, summary2) ... ]
for feed in feeds:
for post in feed.entries:
posts.append((post.title, post.link, post.summary))
df = pd.DataFrame(posts, columns=['title', 'link', 'summary']) # pass data to init
您可以通过组合两个 for 循环来稍微优化一下:
posts = []
for url in rawrss:
feed = feedparser.parse(url)
for post in feed.entries:
posts.append((post.title, post.link, post.summary))
我正在使用 dict
构建 DataFrame:
import feedparser
import pandas as pd
rawrss = [
'http://newsrss.bbc.co.uk/rss/newsonline_uk_edition/front_page/rss.xml',
'https://www.yahoo.com/news/rss/',
'http://www.huffingtonpost.co.uk/feeds/index.xml',
'http://feeds.feedburner.com/TechCrunch/',
]
df = pd.DataFrame([])
for url in rawrss:
dp = feedparser.parse(url)
for i, e in enumerate(dp.entries):
one_feed = {}
one_feed['etitle'] = e.title if 'title' in e else f'title {i}'
one_feed['summary'] = e.summary if 'summary' in e else f'no summary {i}'
one_feed['elink'] = e.link if 'link' in e else f'link {i}'
one_feed['published'] = e.published if 'published' in e else f'no published {i}'
one_feed['elink_img'] = e.links[1].href if 'links' in e and len(e.links)>1 else f'no link_img {i}'
df = df.append(pd.DataFrame([one_feed]), ignore_index=True)
以这种方式添加列更容易。
我正在学习 python。作为练习,我正在构建一个带有 feedparser 的 rss 抓取器,将输出放入 pandas 数据框并尝试使用 NLTK 进行挖掘......但我首先从多个 RSS 提要中获取文章列表。
我用这个 post 如何
问题是什么,我希望能够看到数据框中所有提要的数据。目前我只能访问提要列表中的第一项。
FeedParser 似乎在做它的工作,但是当把它放入 Pandas df 时,它似乎只抓取列表中的第一个 RSS。
import feedparser
import pandas as pd
rawrss = [
'http://newsrss.bbc.co.uk/rss/newsonline_uk_edition/front_page/rss.xml',
'https://www.yahoo.com/news/rss/',
'http://www.huffingtonpost.co.uk/feeds/index.xml',
'http://feeds.feedburner.com/TechCrunch/',
]
feeds = []
for url in rawrss:
feeds.append(feedparser.parse(url))
for feed in feeds:
for post in feed.entries:
print(post.title, post.link, post.summary)
df = pd.DataFrame(columns=['title', 'link', 'summary'])
for i, post in enumerate(feed.entries):
df.loc[i] = post.title, post.link, post.summary
df.shape
df
您的代码将遍历每个 post 并打印其数据。将 post 数据添加到数据帧的代码部分不是循环的一部分(python 缩进是有意义的!),因此您只能在数据帧中看到来自一个提要的数据。
您可以在遍历提要时构建一个 post 的列表,然后在最后创建一个数据框:
import feedparser
import pandas as pd
rawrss = [
'http://newsrss.bbc.co.uk/rss/newsonline_uk_edition/front_page/rss.xml',
'https://www.yahoo.com/news/rss/',
'http://www.huffingtonpost.co.uk/feeds/index.xml',
'http://feeds.feedburner.com/TechCrunch/',
]
feeds = [] # list of feed objects
for url in rawrss:
feeds.append(feedparser.parse(url))
posts = [] # list of posts [(title1, link1, summary1), (title2, link2, summary2) ... ]
for feed in feeds:
for post in feed.entries:
posts.append((post.title, post.link, post.summary))
df = pd.DataFrame(posts, columns=['title', 'link', 'summary']) # pass data to init
您可以通过组合两个 for 循环来稍微优化一下:
posts = []
for url in rawrss:
feed = feedparser.parse(url)
for post in feed.entries:
posts.append((post.title, post.link, post.summary))
我正在使用 dict
构建 DataFrame:
import feedparser
import pandas as pd
rawrss = [
'http://newsrss.bbc.co.uk/rss/newsonline_uk_edition/front_page/rss.xml',
'https://www.yahoo.com/news/rss/',
'http://www.huffingtonpost.co.uk/feeds/index.xml',
'http://feeds.feedburner.com/TechCrunch/',
]
df = pd.DataFrame([])
for url in rawrss:
dp = feedparser.parse(url)
for i, e in enumerate(dp.entries):
one_feed = {}
one_feed['etitle'] = e.title if 'title' in e else f'title {i}'
one_feed['summary'] = e.summary if 'summary' in e else f'no summary {i}'
one_feed['elink'] = e.link if 'link' in e else f'link {i}'
one_feed['published'] = e.published if 'published' in e else f'no published {i}'
one_feed['elink_img'] = e.links[1].href if 'links' in e and len(e.links)>1 else f'no link_img {i}'
df = df.append(pd.DataFrame([one_feed]), ignore_index=True)
以这种方式添加列更容易。