urllib2 没有得到与具有相同代理的普通浏览器相同的 html 字符串(编码错误?)
urllib2 doesn't get same html string as normal browsers with same agents (encoding error?)
我正在尝试从该站点获取页面 http://www.francais-thai.com/dicoweb/fran/00012.htm
但在 python
(页面中有泰文)
这是我试过的代码:
(它应该下载页面)
# -*- coding: utf-8 -*-
import urllib2
agents = {'User-Agent':"Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.125 Safari/537.36"}
url = ("http://www.francais-thai.com/dicoweb/fran/00012.htm")
request = urllib2.Request(url, headers=agents)
page = urllib2.urlopen(request).read()
file = open("00012.htm","w")
file.write(page)
file.close()
但是我通过这种方式获得的页面与 firefox/chrome/etc 显示源代码时给我的页面完全不同
这是我通过 chrome:
获得的页面
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<title>abeille</title>
<style type="text/css">
<!--
body {font-size: medium;}
.FAG {font-family: Arial, Helvetica, sans-serif; font-weight: bold; color: #000;}
.FTN {font-family: "Times New Roman", Times, serif; font-weight: normal; color: #000;}
.PN {font-family: "Times New Roman", Times, serif; font-weight: normal; color: #00F;}
.PG {font-family: "Times New Roman", Times, serif; font-weight: bold; color: #00F;}
.TN {font-family: "Angsana New"; font-weight: normal; color: #F00;}
-->
</style>
</head>
<body lang="fr">
<span class="FAG">abeille</span><span class="FTN">........................................................................... </span><span class="PN">\phugn</span><span class="FAG"> - </span><span class="TN">ผึ้ง</span><br>
</body>
</html>
这是我用我的代码得到的错误页面:
<html xmlns="http://www.w3.org/1999/xhtml">
㰀栀攀愀搀㸀ഀഀ
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
㰀琀椀琀氀攀㸀愀戀攀椀氀氀攀㰀⼀琀椀琀氀攀㸀ഀഀ
<style type="text/css">
㰀℀ⴀⴀഀഀ
body {font-size: medium;}
⸀䘀䄀䜀 笀昀漀渀琀ⴀ昀愀洀椀氀礀㨀 䄀爀椀愀氀Ⰰ 䠀攀氀瘀攀琀椀挀愀Ⰰ 猀愀渀猀ⴀ猀攀爀椀昀㬀 昀漀渀琀ⴀ眀攀椀最栀琀㨀 戀漀氀搀㬀 挀漀氀漀爀㨀 ⌀ 㬀紀ഀഀ
.FTN {font-family: "Times New Roman", Times, serif; font-weight: normal; color: #000;}
⸀倀一 笀昀漀渀琀ⴀ昀愀洀椀氀礀㨀 ∀吀椀洀攀猀 一攀眀 刀漀洀愀渀∀Ⰰ 吀椀洀攀猀Ⰰ 猀攀爀椀昀㬀 昀漀渀琀ⴀ眀攀椀最栀琀㨀 渀漀爀洀愀氀㬀 挀漀氀漀爀㨀 ⌀ 䘀㬀紀ഀഀ
.PG {font-family: "Times New Roman", Times, serif; font-weight: bold; color: #00F;}
⸀吀一 笀昀漀渀琀ⴀ昀愀洀椀氀礀㨀 ∀䄀渀最猀愀渀愀 一攀眀∀㬀 昀漀渀琀ⴀ眀攀椀最栀琀㨀 渀漀爀洀愀氀㬀 挀漀氀漀爀㨀 ⌀䘀 㬀紀ഀഀ
-->
㰀⼀猀琀礀氀攀㸀ഀഀ
</head>
㰀戀漀搀礀 氀愀渀最㴀∀昀爀∀㸀ഀഀ
<span class="FAG">abeille</span><span class="FTN">........................................................................... </span><span class="PN">\phugn</span><span class="FAG"> - </span><span class="TN">ผึ้ง</span><br>
㰀⼀戀漀搀礀㸀ഀഀ
</html>
我尝试更改用户代理,最终使用 wireshark 获得了准确的用户代理,但正在下载相同的 "bugged" 页面,而不是正确的页面
如何获得与普通浏览器 python 相同的 html 文本?
我猜是编码错误(html上有泰语)
但我无法让它工作,我尝试更改编码等。
但我无法让它工作
该页面实际上是 utf-16 编码而不是 utf-8,因此用户代理无关紧要:
request = urllib2.Request(url)
response = urllib2.urlopen(request)
print(response.read().decode("utf-16"))
输出:
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<title>abeille</title>
<style type="text/css">
<!--
body {font-size: medium;}
.FAG {font-family: Arial, Helvetica, sans-serif; font-weight: bold; color: #000;}
.FTN {font-family: "Times New Roman", Times, serif; font-weight: normal; color: #000;}
.PN {font-family: "Times New Roman", Times, serif; font-weight: normal; color: #00F;}
.PG {font-family: "Times New Roman", Times, serif; font-weight: bold; color: #00F;}
.TN {font-family: "Angsana New"; font-weight: normal; color: #F00;}
-->
</style>
</head>
<body lang="fr">
<span class="FAG">abeille</span><span class="FTN">........................................................................... </span><span class="PN">\phugn</span><span class="FAG"> - </span><span class="TN">ผึ้ง</span><br>
</body>
</html>
requests 有同样的问题,使用 chardet 你可以 returns 编码为 UTF-16LE
:
import chardet
print chardet.detect(response.read())
{'confidence': 1.0, 'encoding': 'UTF-16LE'}
我正在尝试从该站点获取页面 http://www.francais-thai.com/dicoweb/fran/00012.htm
但在 python
(页面中有泰文)
这是我试过的代码: (它应该下载页面)
# -*- coding: utf-8 -*-
import urllib2
agents = {'User-Agent':"Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.125 Safari/537.36"}
url = ("http://www.francais-thai.com/dicoweb/fran/00012.htm")
request = urllib2.Request(url, headers=agents)
page = urllib2.urlopen(request).read()
file = open("00012.htm","w")
file.write(page)
file.close()
但是我通过这种方式获得的页面与 firefox/chrome/etc 显示源代码时给我的页面完全不同
这是我通过 chrome:
获得的页面<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<title>abeille</title>
<style type="text/css">
<!--
body {font-size: medium;}
.FAG {font-family: Arial, Helvetica, sans-serif; font-weight: bold; color: #000;}
.FTN {font-family: "Times New Roman", Times, serif; font-weight: normal; color: #000;}
.PN {font-family: "Times New Roman", Times, serif; font-weight: normal; color: #00F;}
.PG {font-family: "Times New Roman", Times, serif; font-weight: bold; color: #00F;}
.TN {font-family: "Angsana New"; font-weight: normal; color: #F00;}
-->
</style>
</head>
<body lang="fr">
<span class="FAG">abeille</span><span class="FTN">........................................................................... </span><span class="PN">\phugn</span><span class="FAG"> - </span><span class="TN">ผึ้ง</span><br>
</body>
</html>
这是我用我的代码得到的错误页面:
<html xmlns="http://www.w3.org/1999/xhtml">
㰀栀攀愀搀㸀ഀഀ
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
㰀琀椀琀氀攀㸀愀戀攀椀氀氀攀㰀⼀琀椀琀氀攀㸀ഀഀ
<style type="text/css">
㰀℀ⴀⴀഀഀ
body {font-size: medium;}
⸀䘀䄀䜀 笀昀漀渀琀ⴀ昀愀洀椀氀礀㨀 䄀爀椀愀氀Ⰰ 䠀攀氀瘀攀琀椀挀愀Ⰰ 猀愀渀猀ⴀ猀攀爀椀昀㬀 昀漀渀琀ⴀ眀攀椀最栀琀㨀 戀漀氀搀㬀 挀漀氀漀爀㨀 ⌀ 㬀紀ഀഀ
.FTN {font-family: "Times New Roman", Times, serif; font-weight: normal; color: #000;}
⸀倀一 笀昀漀渀琀ⴀ昀愀洀椀氀礀㨀 ∀吀椀洀攀猀 一攀眀 刀漀洀愀渀∀Ⰰ 吀椀洀攀猀Ⰰ 猀攀爀椀昀㬀 昀漀渀琀ⴀ眀攀椀最栀琀㨀 渀漀爀洀愀氀㬀 挀漀氀漀爀㨀 ⌀ 䘀㬀紀ഀഀ
.PG {font-family: "Times New Roman", Times, serif; font-weight: bold; color: #00F;}
⸀吀一 笀昀漀渀琀ⴀ昀愀洀椀氀礀㨀 ∀䄀渀最猀愀渀愀 一攀眀∀㬀 昀漀渀琀ⴀ眀攀椀最栀琀㨀 渀漀爀洀愀氀㬀 挀漀氀漀爀㨀 ⌀䘀 㬀紀ഀഀ
-->
㰀⼀猀琀礀氀攀㸀ഀഀ
</head>
㰀戀漀搀礀 氀愀渀最㴀∀昀爀∀㸀ഀഀ
<span class="FAG">abeille</span><span class="FTN">........................................................................... </span><span class="PN">\phugn</span><span class="FAG"> - </span><span class="TN">ผึ้ง</span><br>
㰀⼀戀漀搀礀㸀ഀഀ
</html>
我尝试更改用户代理,最终使用 wireshark 获得了准确的用户代理,但正在下载相同的 "bugged" 页面,而不是正确的页面
如何获得与普通浏览器 python 相同的 html 文本?
我猜是编码错误(html上有泰语) 但我无法让它工作,我尝试更改编码等。 但我无法让它工作
该页面实际上是 utf-16 编码而不是 utf-8,因此用户代理无关紧要:
request = urllib2.Request(url)
response = urllib2.urlopen(request)
print(response.read().decode("utf-16"))
输出:
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<title>abeille</title>
<style type="text/css">
<!--
body {font-size: medium;}
.FAG {font-family: Arial, Helvetica, sans-serif; font-weight: bold; color: #000;}
.FTN {font-family: "Times New Roman", Times, serif; font-weight: normal; color: #000;}
.PN {font-family: "Times New Roman", Times, serif; font-weight: normal; color: #00F;}
.PG {font-family: "Times New Roman", Times, serif; font-weight: bold; color: #00F;}
.TN {font-family: "Angsana New"; font-weight: normal; color: #F00;}
-->
</style>
</head>
<body lang="fr">
<span class="FAG">abeille</span><span class="FTN">........................................................................... </span><span class="PN">\phugn</span><span class="FAG"> - </span><span class="TN">ผึ้ง</span><br>
</body>
</html>
requests 有同样的问题,使用 chardet 你可以 returns 编码为 UTF-16LE
:
import chardet
print chardet.detect(response.read())
{'confidence': 1.0, 'encoding': 'UTF-16LE'}