请求和 urllib2 从 XBRL 页面获取错误。 'The browser mode you are running is not compatible with this application'

request and urllib2 get error from XBRL page. 'The browser mode you are running is not compatible with this application'

不知道为什么我无法从这个 link 获取页面。我想做的就是得到它并输入 beautifulsoup.

import requests,urllib2

link='https://www.sec.gov/ix?doc=/Archives/edgar/data/1373715/000137371518000157/now-2018630x10q.htm'

r = requests.get(link)

r2=urllib2.urlopen(link)
html=r2.read()

还尝试伪造浏览器:

headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}

r = requests.get(link, headers=headers)

文字相同...不是我想要的页面。

得到一个 header 看起来像这样

var note = 'The browser mode you are running is not compatible with this application.';

            browserName ='Microsoft Internet Explorer';

            note +='You are currently running '+browserName+' '+((ie7>0)?7:8)+'.0.';       

                var userAgent = window.navigator.userAgent.toLowerCase();           

                if(userAgent.indexOf('ipad') != -1 || userAgent.indexOf('iphone') != -1 || userAgent.indexOf('apple') != -1){               

                    note += ' Please use a more current version of '+browserName+' in order to use the application.';

                }else if(userAgent.indexOf('android') != -1){               

                    note += ' Please use a more current version of Google Chrome or Mozilla Firefox in order to use the application.';

                }else{              

                    note += ' Please use a more current version of Microsoft Internet Explorer, Google Chrome or Mozilla Firefox in order to use the application.';

                }

我可以正常访问此页面: https://www.sec.gov/Archives/edgar/data/1373715/000137371518000153/erq2fy18-document.htm

这不是 XBRL 文档。我认为这与 XBRL 和服务器希望我的浏览器与数据交互有关?

这部分页面好像是用js渲染的。通常动态内容最可靠的选项是 selenium,但在这种情况下您可以避免使用它并使用 requests

很明显该页面使用了本文档的内容/Archives/edgar/data/1373715/000137371518000157/now-2018630x10q.htm。您可以绕过该页面并直接请求文档。

import requests

url = "https://www.sec.gov/Archives/edgar/data/1373715/000137371518000157/now-2018630x10q.htm"
r = requests.get(url)
html = r.text

print(html)