使用 Python 和 newspaper3k lib 的网页抓取没有 return 数据

Question

我已经用 sudo pip3 install Newspapper3k 在我的 Mac 上安装了 Newspapper3k 库。我正在使用 Python 3。我想要 return 文章 object 支持的数据，即 url、日期、标题、文本、摘要和关键字，但我没有得到任何数据：

import newspaper
from newspaper import Article

#creating website for scraping
cnn_paper = newspaper.build('https://www.euronews.com/', memoize_articles=False)

#I have tried for https://www.euronews.com/, https://edition.cnn.com/, https://www.bbc.com/


for article in cnn_paper.articles:

    article_url = article.url #works

    news_article = Article(article_url)#works

    print("OBJECT:", news_article, '\n')#works
    print("URL:", article_url, '\n')#works
    print("DATE:", news_article.publish_date, '\n')#does not work
    print("TITLE:", news_article.title, '\n')#does not work
    print("TEXT:", news_article.text, '\n')#does not work
    print("SUMMARY:", news_article.summary, '\n')#does not work
    print("KEYWORDS:", news_article.keywords, '\n')#does not work
    print()
    input()

我得到文章 object 和 URL 但其他所有内容都是 ''。我试过不同的网站，结果都是一样的。

然后我尝试添加：

    news_article.download()
    news_article.parse()
    news_article.nlp()

我也试过设置 Config 并设置 HEADERS 和 TIMEOUTs 但结果是一样的。

当我这样做时，对于每个网站，我只获得 16 篇带有日期、标题和 body 值的文章。这对我来说很奇怪，对于每个网站，我得到的数据数量相同，但对于超过 95% 的新闻文章，我得到 None.

Beautiful Soup 能帮帮我吗？

有人可以帮助我了解问题所在，为什么我得到这么多 Null/Nan/"" 值，我该如何解决？

这是 lib 的文档：

https://newspaper.readthedocs.io/en/latest/

Answer 1

我建议您查看我在 GitHub 上发布的 newspaper overview 文档。该文档有多个提取示例和其他可能有用的技术。

关于您的问题...

Newspaper3K 将近乎完美地解析某些网站。但是有很多网站需要检查页面的导航结构以确定如何正确解析文章元素。

例如，https://www.marketwatch.com 有单独的文章元素，例如标题、发布日期和其他存储在页面元标记部分中的项目。

下面的 newspaper 示例将正确解析元素。我注意到您可能需要对关键字或标签输出进行一些数据清理。

import newspaper
from newspaper import Config
from newspaper import Article

USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:78.0) Gecko/20100101 Firefox/78.0'

config = Config()
config.browser_user_agent = USER_AGENT
config.request_timeout = 10

base_url = 'https://www.marketwatch.com'
article_urls = set()
marketwatch = newspaper.build(base_url, config=config, memoize_articles=False, language='en')
for sub_article in marketwatch.articles:
article = Article(sub_article.url, config=config, memoize_articles=False, language='en')
article.download()
article.parse()
if article.url not in article_urls:
    article_urls.add(article.url)

    # The majority of the article elements are located
    # within the meta data section of the page's
    # navigational structure
    article_meta_data = article.meta_data

    published_date = {value for (key, value) in article_meta_data.items() if key == 'parsely-pub-date'}
    article_published_date = " ".join(str(x) for x in published_date)

    authors = sorted({value for (key, value) in article_meta_data.items() if key == 'parsely-author'})
    article_author = ', '.join(authors)

    title = {value for (key, value) in article_meta_data.items() if key == 'parsely-title'}
    article_title = " ".join(str(x) for x in title)

    keywords = ''.join({value for (key, value) in article_meta_data.items() if key == 'keywords'})
    keywords_list = sorted(keywords.lower().split(','))
    article_keywords = ', '.join(keywords_list)

    tags = ''.join({value for (key, value) in article_meta_data.items() if key == 'parsely-tags'})
    tag_list = sorted(tags.lower().split(','))
    article_tags = ', '.join(tag_list)

    summary = {value for (key, value) in article_meta_data.items() if key == 'description'}
    article_summary = " ".join(str(x) for x in summary)

    # the replace is used to remove newlines
    article_text = article.text.replace('\n', '')
    print(article_text)

https://www.euronews.com 类似于 https://www.marketwatch.com, 除了某些文章元素位于主要 body 中，而其他项目位于元标记部分中。

import newspaper
from newspaper import Config
from newspaper import Article

USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:78.0) Gecko/20100101 Firefox/78.0'

config = Config()
config.browser_user_agent = USER_AGENT
config.request_timeout = 10

base_url = 'https://www.euronews.com'
article_urls = set()
euronews = newspaper.build(base_url, config=config, memoize_articles=False, language='en')
for sub_article in euronews.articles:
   if sub_article.url not in article_urls:
     article_urls.add(sub_article.url)
     article = Article(sub_article.url, config=config, memoize_articles=False, language='en')
     article.download()
     article.parse()

     # The majority of the article elements are located
     # within the meta data section of the page's
     # navigational structure
     article_meta_data = article.meta_data
    
     published_date = {value for (key, value) in article_meta_data.items() if key == 'date.created'}
     article_published_date = " ".join(str(x) for x in published_date)
    
     article_title = article.title

     summary = {value for (key, value) in article_meta_data.items() if key == 'description'}
     article_summary = " ".join(str(x) for x in summary)

     keywords = ''.join({value for (key, value) in article_meta_data.items() if key == 'keywords'})
     keywords_list = sorted(keywords.lower().split(','))
     article_keywords = ', '.join(keywords_list).strip()

     # the replace is used to remove newlines
     article_text = article.text.replace('\n', '')

使用 Python 和 newspaper3k lib 的网页抓取没有 return 数据

Web Scraping with Python and newspaper3k lib does not return data

python

web-scraping

python-newspaper

newspaper3k