使用 HTML 中的报纸提取图像
Extract image using Newspaper from HTML
我无法像通常那样下载文章来实例化 Article 对象,如下所示:
from newspaper import Article
url = 'http://fox13now.com/2013/12/30/new-year-new-laws-obamacare-pot-guns-and-drones/'
article = Article(url)
article.download()
article.top_image
但是,我可以从请求中获取 HTML。我可以使用这个原始 HTML 并以某种方式将它传递给报纸以从中提取图像吗? (下面是一次尝试,但不起作用)。谢谢
from newspaper import Article
import requests
url = 'http://fox13now.com/2013/12/30/new-year-new-laws-obamacare-pot-guns-and-drones/'
raw_html= requests.get(url, verify=False, proxies=proxy)
article = Article('')
article.set_html(raw_html)
article.top_image
首先确保您使用的是 python3
,您之前有 运行 pip3 install newspaper3k
。
然后,如果您在第一个版本中遇到 SSL 错误(如下所示)
/usr/local/lib/python3.8/site-packages/urllib3/connectionpool.py:981: InsecureRequestWarning: Unverified HTTPS request is being made to host 'fox13now.com'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
warnings.warn(
您可以通过添加
来禁用它们
import urllib3
urllib3.disable_warnings()
这应该有效:
from newspaper import Article
import urllib3
urllib3.disable_warnings()
url = "https://www.fox13now.com/2013/12/30/new-year-new-laws-obamacare-pot-guns-and-drones/"
article = Article(url)
article.download()
print(article.html)
运行 与 python3 <yourfile>.py
.
自己在文章中设置 html 对您没有多大好处,因为那样您将无法在其他字段中获得任何东西。让我知道这是否解决了问题,或者是否弹出任何其他错误!
Python 模块 Newspaper 允许使用代理,但此功能未在模块文档中列出。
报纸代理
from newspaper import Article
from newspaper.configuration import Configuration
# add your corporate proxy information and test the connection
PROXIES = {
'http': "http://ip_address:port_number",
'https': "https://ip_address:port_number"
}
config = Configuration()
config.proxies = PROXIES
url = 'http://fox13now.com/2013/12/30/new-year-new-laws-obamacare-pot-guns-and-drones/'
articles = Article(url, config=config)
articles.download()
articles.parse()
print(articles.top_image)
https://ewscripps.brightspotcdn.com/dims4/default/d49dab0/2147483647/strip/true/crop/400x210+0+8/resize/1200x630!/quality/90/?url=http%3A%2F%2Fmediaassets.fox13now.com%2Ftribune-network%2Ftribkstu-files-wordpress%2F2012%2F04%2Fnational-news-e1486938949489.jpg
代理和报纸请求
import requests
from newspaper import Article
url = 'http://fox13now.com/2013/12/30/new-year-new-laws-obamacare-pot-guns-and-drones/'
raw_html = requests.get(url, verify=False, proxies=proxy)
article = Article('')
article.download(raw_html.content)
article.parse()
print(article.top_image) https://ewscripps.brightspotcdn.com/dims4/default/d49dab0/2147483647/strip/true/crop/400x210+0+8/resize/1200x630!/quality/90/?url=http%3A%2F%2Fmediaassets.fox13now.com%2Ftribune-network%2Ftribkstu-files-wordpress%2F2012%2F04%2Fnational-news-e1486938949489.jpg
我无法像通常那样下载文章来实例化 Article 对象,如下所示:
from newspaper import Article
url = 'http://fox13now.com/2013/12/30/new-year-new-laws-obamacare-pot-guns-and-drones/'
article = Article(url)
article.download()
article.top_image
但是,我可以从请求中获取 HTML。我可以使用这个原始 HTML 并以某种方式将它传递给报纸以从中提取图像吗? (下面是一次尝试,但不起作用)。谢谢
from newspaper import Article
import requests
url = 'http://fox13now.com/2013/12/30/new-year-new-laws-obamacare-pot-guns-and-drones/'
raw_html= requests.get(url, verify=False, proxies=proxy)
article = Article('')
article.set_html(raw_html)
article.top_image
首先确保您使用的是 python3
,您之前有 运行 pip3 install newspaper3k
。
然后,如果您在第一个版本中遇到 SSL 错误(如下所示)
/usr/local/lib/python3.8/site-packages/urllib3/connectionpool.py:981: InsecureRequestWarning: Unverified HTTPS request is being made to host 'fox13now.com'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings warnings.warn(
您可以通过添加
来禁用它们import urllib3
urllib3.disable_warnings()
这应该有效:
from newspaper import Article
import urllib3
urllib3.disable_warnings()
url = "https://www.fox13now.com/2013/12/30/new-year-new-laws-obamacare-pot-guns-and-drones/"
article = Article(url)
article.download()
print(article.html)
运行 与 python3 <yourfile>.py
.
自己在文章中设置 html 对您没有多大好处,因为那样您将无法在其他字段中获得任何东西。让我知道这是否解决了问题,或者是否弹出任何其他错误!
Python 模块 Newspaper 允许使用代理,但此功能未在模块文档中列出。
报纸代理
from newspaper import Article
from newspaper.configuration import Configuration
# add your corporate proxy information and test the connection
PROXIES = {
'http': "http://ip_address:port_number",
'https': "https://ip_address:port_number"
}
config = Configuration()
config.proxies = PROXIES
url = 'http://fox13now.com/2013/12/30/new-year-new-laws-obamacare-pot-guns-and-drones/'
articles = Article(url, config=config)
articles.download()
articles.parse()
print(articles.top_image)
https://ewscripps.brightspotcdn.com/dims4/default/d49dab0/2147483647/strip/true/crop/400x210+0+8/resize/1200x630!/quality/90/?url=http%3A%2F%2Fmediaassets.fox13now.com%2Ftribune-network%2Ftribkstu-files-wordpress%2F2012%2F04%2Fnational-news-e1486938949489.jpg
代理和报纸请求
import requests
from newspaper import Article
url = 'http://fox13now.com/2013/12/30/new-year-new-laws-obamacare-pot-guns-and-drones/'
raw_html = requests.get(url, verify=False, proxies=proxy)
article = Article('')
article.download(raw_html.content)
article.parse()
print(article.top_image) https://ewscripps.brightspotcdn.com/dims4/default/d49dab0/2147483647/strip/true/crop/400x210+0+8/resize/1200x630!/quality/90/?url=http%3A%2F%2Fmediaassets.fox13now.com%2Ftribune-network%2Ftribkstu-files-wordpress%2F2012%2F04%2Fnational-news-e1486938949489.jpg