Newspaper3k、用户代理和抓取
Newspaper3k, User Agents and Scraping
我正在制作包含作者、出版日期和正文[=23]的文本文件=] 的新闻文章。我有执行此操作的代码,但我需要 Newspaper3k
首先从这些文章中识别相关信息。由于用户代理规范 an issue ,我也指定了用户代理。这是我的代码,因此您可以跟进。这是 Python 的 version 3.9.0
。
import time, os, random, nltk, newspaper
from newspaper import Article, Config
user_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
config = Config()
config.browser_user_agent = user_agent
url = 'https://www.eluniversal.com.mx/estados/matan-3-policias-durante-ataque-en-nochistlan-zacatecas'
article = Article(url, config=config)
article.download()
#article.html #
article.parse()
article.nlp()
article.authors
article.publish_date
article.text
为了更好地理解为什么这个案例 特别 令人费解,请用我上面提供的 link 替换这个案例,然后重新 运行编码。使用 this link,代码现在 运行s 正确,返回作者、日期和文本。使用上面代码中的 link,它不会。我在这里忽略了什么?
显然,报纸要求我们指定我们感兴趣的语言。这里的代码仍然出于某种奇怪的原因没有提取作者,但这已经足够了我。这是代码,如果其他人会从中受益的话。
#
# Imports our modules
#
import time, os, random, nltk, newspaper
from newspaper import Article
from googletrans import Translator
translator = Translator()
# The link we're interested in
url = 'https://www.eluniversal.com.mx/estados/matan-3-policias-durante-ataque-en-nochistlan-zacatecas'
#
# Extracts the meta-data
#
article = Article(url, language='es')
article.download()
article.parse()
article.nlp()
#
# Makes these into strings so they'll get into the list
#
authors = str(article.authors)
date = str(article.publish_date)
maintext = translator.translate(article.summary).text
# Makes the list we'll append
elements = [authors+ "\n", date+ "\n", maintext+ "\n", url]
for x in elements:
print(x)
我正在制作包含作者、出版日期和正文[=23]的文本文件=] 的新闻文章。我有执行此操作的代码,但我需要 Newspaper3k
首先从这些文章中识别相关信息。由于用户代理规范version 3.9.0
。
import time, os, random, nltk, newspaper
from newspaper import Article, Config
user_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
config = Config()
config.browser_user_agent = user_agent
url = 'https://www.eluniversal.com.mx/estados/matan-3-policias-durante-ataque-en-nochistlan-zacatecas'
article = Article(url, config=config)
article.download()
#article.html #
article.parse()
article.nlp()
article.authors
article.publish_date
article.text
为了更好地理解为什么这个案例 特别 令人费解,请用我上面提供的 link 替换这个案例,然后重新 运行编码。使用 this link,代码现在 运行s 正确,返回作者、日期和文本。使用上面代码中的 link,它不会。我在这里忽略了什么?
显然,报纸要求我们指定我们感兴趣的语言。这里的代码仍然出于某种奇怪的原因没有提取作者,但这已经足够了我。这是代码,如果其他人会从中受益的话。
#
# Imports our modules
#
import time, os, random, nltk, newspaper
from newspaper import Article
from googletrans import Translator
translator = Translator()
# The link we're interested in
url = 'https://www.eluniversal.com.mx/estados/matan-3-policias-durante-ataque-en-nochistlan-zacatecas'
#
# Extracts the meta-data
#
article = Article(url, language='es')
article.download()
article.parse()
article.nlp()
#
# Makes these into strings so they'll get into the list
#
authors = str(article.authors)
date = str(article.publish_date)
maintext = translator.translate(article.summary).text
# Makes the list we'll append
elements = [authors+ "\n", date+ "\n", maintext+ "\n", url]
for x in elements:
print(x)