使用 Python 中的 NewsPaper 库将新闻文章抓取到一个列表中?

Scraping news articles into one single list with NewsPaper library in Python?

亲爱的 Whosebug 社区!

我想从 CNN RSS 提要中抓取新闻文章,并为每篇抓取的文章获取 link。这与 Python NewsPaper 库配合得很好,但不幸的是我无法获得可用格式的输出,即列表或字典。

我想将抓取的 link 添加到一个列表中,而不是许多单独的列表中。

    import feedparser as fp
    import newspaper
    from newspaper import Article

    website = {"cnn": {"link": "http://edition.cnn.com/", "rss": "http://rss.cnn.com/rss/cnn_topstories.rss"}}

    for source, value in website.items():
        if 'rss' in value:
            d = fp.parse(value['rss']) 
            #if there is an RSS value for a company, it will be extracted into d

            for entry in d.entries:
                if hasattr(entry, 'published'):
                    article = {}
                    article['link'] = entry.link
                    print(article['link'])

输出结果如下:

http://rss.cnn.com/~r/rss/cnn_topstories/~3/5aHaFHz2VtI/index.html
http://rss.cnn.com/~r/rss/cnn_topstories/~3/_O8rud1qEXA/joe-walsh-trump-gop-voters-sot-crn-vpx.cnn
http://rss.cnn.com/~r/rss/cnn_topstories/~3/xj-0PnZ_LwU/index.html
.......

我想要一个包含所有 link 的列表,即:

    list =[http://rss.cnn.com/~r/rss/cnn_topstories/~3/5aHaFHz2VtI/index.html , http://rss.cnn.com/~r/rss/cnn_topstories/~3/_O8rud1qEXA/joe-walsh-trump-gop-voters-sot-crn-vpx.cnn , http://rss.cnn.com/~r/rss/cnn_topstories/~3/xj-0PnZ_LwU/index.html ,... ]

我尝试通过 for 循环附加内容,如下所示:

    for i in article['link']:
        article_list = []
        article_list.append(i)
        print(article_list)

但是输出是这样的:

['h']
['t']
['t']
['p']
[':']
['/']
['/']
['r']
['s']
...

有谁知道另一种方法,如何将内容放入一个列表中? 或者如下字典:

    dict = {'links':[link1 , link2 , link 3]}

非常感谢您的帮助!

尝试像这样修改您的代码,看看它是否有效:

article_list = []
for entry in d.entries:
            if hasattr(entry, 'published'):
                article = {}
                article['link'] = entry.link
                article_list.append(article['link'])