urllib2 没有得到与具有相同代理的普通浏览器相同的 html 字符串(编码错误?)

urllib2 doesn't get same html string as normal browsers with same agents (encoding error?)

我正在尝试从该站点获取页面 http://www.francais-thai.com/dicoweb/fran/00012.htm

但在 python

(页面中有泰文)

这是我试过的代码: (它应该下载页面)

# -*- coding: utf-8 -*-
import urllib2

agents = {'User-Agent':"Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.125 Safari/537.36"}

url = ("http://www.francais-thai.com/dicoweb/fran/00012.htm")
request = urllib2.Request(url, headers=agents)
page = urllib2.urlopen(request).read()


file = open("00012.htm","w")
file.write(page)
file.close()

但是我通过这种方式获得的页面与 firefox/chrome/etc 显示源代码时给我的页面完全不同

这是我通过 chrome:

获得的页面
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<title>abeille</title>
<style type="text/css">
<!--
body {font-size: medium;}
.FAG {font-family: Arial, Helvetica, sans-serif; font-weight: bold; color: #000;}
.FTN {font-family: "Times New Roman", Times, serif; font-weight: normal; color: #000;}
.PN {font-family: "Times New Roman", Times, serif; font-weight: normal; color: #00F;}
.PG {font-family: "Times New Roman", Times, serif; font-weight: bold; color: #00F;}
.TN {font-family: "Angsana New"; font-weight: normal; color: #F00;}
-->
</style>
</head>
<body lang="fr">
<span class="FAG">abeille</span><span class="FTN">........................................................................... </span><span class="PN">\phugn</span><span class="FAG"> - </span><span class="TN">ผึ้ง</span><br>
</body>
</html>

这是我用我的代码得到的错误页面:

<html xmlns="http://www.w3.org/1999/xhtml">
਍㰀栀攀愀搀㸀ഀഀ
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
਍㰀琀椀琀氀攀㸀愀戀攀椀氀氀攀㰀⼀琀椀琀氀攀㸀ഀഀ
<style type="text/css">
਍㰀℀ⴀⴀഀഀ
body {font-size: medium;}
਍⸀䘀䄀䜀 笀昀漀渀琀ⴀ昀愀洀椀氀礀㨀 䄀爀椀愀氀Ⰰ 䠀攀氀瘀攀琀椀挀愀Ⰰ 猀愀渀猀ⴀ猀攀爀椀昀㬀 昀漀渀琀ⴀ眀攀椀最栀琀㨀 戀漀氀搀㬀 挀漀氀漀爀㨀 ⌀   㬀紀ഀഀ
.FTN {font-family: "Times New Roman", Times, serif; font-weight: normal; color: #000;}
਍⸀倀一 笀昀漀渀琀ⴀ昀愀洀椀氀礀㨀 ∀吀椀洀攀猀 一攀眀 刀漀洀愀渀∀Ⰰ 吀椀洀攀猀Ⰰ 猀攀爀椀昀㬀 昀漀渀琀ⴀ眀攀椀最栀琀㨀 渀漀爀洀愀氀㬀 挀漀氀漀爀㨀 ⌀  䘀㬀紀ഀഀ
.PG {font-family: "Times New Roman", Times, serif; font-weight: bold; color: #00F;}
਍⸀吀一 笀昀漀渀琀ⴀ昀愀洀椀氀礀㨀 ∀䄀渀最猀愀渀愀 一攀眀∀㬀 昀漀渀琀ⴀ眀攀椀最栀琀㨀 渀漀爀洀愀氀㬀 挀漀氀漀爀㨀 ⌀䘀  㬀紀ഀഀ
-->
਍㰀⼀猀琀礀氀攀㸀ഀഀ
</head>
਍㰀戀漀搀礀 氀愀渀最㴀∀昀爀∀㸀ഀഀ
<span class="FAG">abeille</span><span class="FTN">........................................................................... </span><span class="PN">\phugn</span><span class="FAG"> - </span><span class="TN">ผึ้ง</span><br>
਍㰀⼀戀漀搀礀㸀ഀഀ
</html>
਍

我尝试更改用户代理,最终使用 wireshark 获得了准确的用户代理,但正在下载相同的 "bugged" 页面,而不是正确的页面

如何获得与普通浏览器 python 相同的 html 文本?

我猜是编码错误(html上有泰语) 但我无法让它工作,我尝试更改编码等。 但我无法让它工作

该页面实际上是 utf-16 编码而不是 utf-8,因此用户代理无关紧要:

request = urllib2.Request(url)
response = urllib2.urlopen(request)

print(response.read().decode("utf-16"))

输出:

<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<title>abeille</title>
<style type="text/css">
<!--
body {font-size: medium;}
.FAG {font-family: Arial, Helvetica, sans-serif; font-weight: bold; color: #000;}
.FTN {font-family: "Times New Roman", Times, serif; font-weight: normal; color: #000;}
.PN {font-family: "Times New Roman", Times, serif; font-weight: normal; color: #00F;}
.PG {font-family: "Times New Roman", Times, serif; font-weight: bold; color: #00F;}
.TN {font-family: "Angsana New"; font-weight: normal; color: #F00;}
-->
</style>
</head>
<body lang="fr">
<span class="FAG">abeille</span><span class="FTN">........................................................................... </span><span class="PN">\phugn</span><span class="FAG"> - </span><span class="TN">ผึ้ง</span><br>
</body>
</html>

requests 有同样的问题,使用 chardet 你可以 returns 编码为 UTF-16LE:

import chardet
print chardet.detect(response.read())
{'confidence': 1.0, 'encoding': 'UTF-16LE'}