使用 urllib 从 https 站点提取数据到 python（您的请求无法完成错误）

Question

我一直在尝试使用 urllib 将 https 网站的内容提取到 python 中。我用了4行代码。

import urllib
fhand = urllib.urlopen('https://www.tax.service.gov.uk/view-my-valuation/list-valuations-by-postcode?postcode=w1a&startPage=1#search-results')

for line in fhand:
    print line.strip()

从 python 打开页面时，连接似乎正常工作。但是，我在标题、标题和段落标题的输出中收到了一些不同的错误消息，如下所示。我原以为输出是一系列 html 标签，其中包含网站上可用的数据，例如地址、基本费率和案例编号（即 html，如果我进入google chrome 开发者的元素）。谁能指导我将这些数据输入 python？

感谢和问候

<!DOCTYPE html>
<html class="no-branding"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1">
<title>Your request cannot be completed - GOV.UK</title>
<link href="/edge-assets/gone.css" media="screen" rel="stylesheet" type="text/css">
<!--[if lte IE 8]><link href="/edge-assets/ie.css" media="screen" rel="stylesheet" type="text/css"><![endif]-->
<link rel="icon" href="/edge-assets/govukfavicon.ico" type="image/x-icon" />
</head>
<body>
<div id="wrapper">
<div id="banner" role="banner">
<div class="inner">
<h1>
<a href="https://www.gov.uk/">
<img src="/edge-assets/govuk-logo.png" alt="GOV.UK">
</a>
</h1>
</div>
</div>
<div id="message" role="main">
<div class="inner">
<div id="detail">
<h2>Sorry, there was a problem handling your request.</h2>
<p class="call-to-action">Please try again shortly.</p>
</div>
<div id="footer">
</div>
</div>
</div>
</div>
</body></html>

Answer 1

某些网站在 user-agent 未指定或不符合要求时会阻止请求。因此，请尝试在请求

的 headers 中添加 user-agent

import urllib2


headers = {'User-Agent': 'Mozilla/5.0'}
url = 'https://www.tax.service.gov.uk/view-my-valuation/list-valuations-by-postcode?postcode=w1a&startPage=1#search-results'
req = urllib2.Request(url, headers=HEADERS)
f = urllib2.urlopen(req)
s = f.read()
print s
f.close()

或者您可以 pip install requests 并使用 print(requests.get(url).text)

使用 urllib 从 https 站点提取数据到 python（您的请求无法完成错误）

extract data from https site into python using urllib (your request cannot be completed error)

python

https

urllib