是什么导致 "urlopen error [Errno 13] Permission denied" 错误?

What causes "urlopen error [Errno 13] Permission denied" errors?

我正在尝试在 Centos7 服务器上编写 python(版本 2.7.5)CGI 脚本。 我的脚本尝试从 librivox 的网页下载数据,例如... https://librivox.org/selections-from-battle-pieces-and-aspects-of-the-war-by-herman-melville/ 并且我的脚本因以下错误而崩溃:

<class 'urllib2.URLError'>: <urlopen error [Errno 13] Permission denied> 
      args = (error(13, 'Permission denied'),) 
      errno = None 
      filename = None 
      message = '' 
      reason = error(13, 'Permission denied') 
      strerror = None

我已经关机了 iptables 我可以执行 `wget -O- https://librivox.org/selections-from-battle-pieces-and-aspects-of-the-war-by-herman-melville/' 之类的操作而不会出错。这是发生错误的代码位:

def output_html ( url, appname, doobb ):
        print "url is %s<br>" % url
        soup = BeautifulSoup(urllib2.urlopen( url ).read())

更新:谢谢 Paul 和 alecxe 我已经更新了我的代码,如下所示:

def output_html ( url, appname, doobb ):
        #hdr = {'User-Agent':'Mozilla/5.0'}
        #print "url is %s<br>" % url
        #req = url2lib2.Request(url, headers=hdr)
        # soup = BeautifulSoup(urllib2.urlopen( url ).read())
        headers = {'User-Agent':'Mozilla/5.0'}
        # headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.99 Safari/537.36'}
        response = requests.get( url, headers=headers)

        soup = BeautifulSoup(response.content)

...当...

时我得到一个稍微不同的错误
response = requests.get( url, headers=headers)

...被调用...

<class 'requests.exceptions.ConnectionError'>: ('Connection aborted.', error(13, 'Permission denied')) 
      args = (ProtocolError('Connection aborted.', error(13, 'Permission denied')),) 
      errno = None 
      filename = None 
      message = ProtocolError('Connection aborted.', error(13, 'Permission denied')) 
      request = <PreparedRequest [GET]> 
      response = None 
      strerror = None

...有趣的是写了这个脚本的命令行版本,它工作正常,看起来像这样...

def output_html ( url ):
        soup = BeautifulSoup(urllib2.urlopen( url ).read())

你不觉得很奇怪吗?

更新: 这个问题可能已经在这里有了答案: urllib2.HTTPError:HTTP 错误 403:禁止访问 2 个答案

不,他们不回答问题

使用 requests 并提供 User-Agent header 适合我:

from bs4 import BeautifulSoup
import requests

headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.99 Safari/537.36'}
response = requests.get("https://librivox.org/selections-from-battle-pieces-and-aspects-of-the-war-by-herman-melville/", headers=headers)

soup = BeautifulSoup(response.content)
print soup.title.text  # "prints LibriVox"

终于想通了……

# grep python /var/log/audit/audit.log | audit2allow -M mypol
# semodule -i mypol.pp

我们的一台机器也遇到了同样的问题。我们没有创建 SELinux 模块(如上面的答案中所列),而是对 SELinux 布尔值进行了以下更改,以防止发生类似的错误

# setsebool httpd_can_network_connect on

如 centos wiki 上所述

httpd_can_network_connect(HTTPD 服务):: 允许 HTTPD 脚本和模块连接到网络。