使用 Python 中的 NewsPaper 库将多个新闻文章源抓取到一个列表中?

Scraping multiple news article sources into one single list with NewsPaper library in Python?

亲爱的 Whosebug 社区!

这是关于我之前发布的问题 的后续问题。

我想从多个来源中使用 NewsPaper 库将新闻纸 URLS 提取到一个列表中。这对一个来源很有效,但是一旦我添加第二个来源 link,它就只提取第二个来源的 URL。

    import feedparser as fp
    import newspaper
    from newspaper import Article

    website = {"cnn": {"link": "edition.cnn.com", "rss": "rss.cnn.com/rss/cnn_topstories.rss"}, "cnbc":{"link": "cnbc.com", "rss": "cnbc.com/id/10000664/device/rss/rss.html"}} A

    for source, value in website.items():
        if 'rss' in value:
            d = fp.parse(value['rss']) 
            #if there is an RSS value for a company, it will be extracted into d
            article_list = []

            for entry in d.entries:
                if hasattr(entry, 'published'):
                    article = {}
                    article['link'] = entry.link
                    article_list.append(article['link'])
                    print(article['link'])

输出如下,仅附加来自第二个来源的links:

    ['https://www.cnbc.com/2019/10/23/why-china-isnt-cutting-lending-rates-like-the-rest-of-the-world.html', 'https://www.cnbc.com/2019/10/22/stocks-making-the-biggest-moves-after-hours-snap-texas-instruments-chipotle-and-more.html' , ...]

我希望将来自两个来源的所有 URL 都提取到列表中。 有谁知道这个问题的解决方案? 非常感谢您!!

article_list 在您的第一个 for 循环中被覆盖。每次迭代一个新源时,您 ​​article_list 都会被设置为一个新的空列表,实际上会丢失来自先前源的所有信息。这就是为什么最后你只有一个来源的信息,最后一个

你应该在开头初始化 article_list 而不是覆盖它。

import feedparser as fp
import newspaper
from newspaper import Article

website = {"cnn": {"link": "edition.cnn.com", "rss": "rss.cnn.com/rss/cnn_topstories.rss"}, "cnbc":{"link": "cnbc.com", "rss": "cnbc.com/id/10000664/device/rss/rss.html"}} A

article_list = [] # INIT ONCE
for source, value in website.items():
    if 'rss' in value:
        d = fp.parse(value['rss']) 
        #if there is an RSS value for a company, it will be extracted into d
        # article_list = [] THIS IS WHERE IT WAS BEING OVERWRITTEN

        for entry in d.entries:
            if hasattr(entry, 'published'):
                article = {}
                article['link'] = entry.link
                article_list.append(article['link'])
                print(article['link'])