字符编码问题看起来与我手动下载文件时得到的不同
Issue with character encoding looking different from what I get when I download a file manually
我正在尝试使用以下 google translate api 端点来翻译应用程序中的文本:
https://clients5.google.com/translate_a/t?client=dict-chrome-ex&sl=auto&tl=en&q=контрольная%20работа
当我点击 link 时,它会下载一个文本文件,打开时包含我需要的所有信息,格式似乎正确(sentences[0].trans = "text" 是一样的格式就像我手动写出“文本”这个词一样)。
然而,在 C# 中使用 www 文件请求时,在 python 中使用 requests.get,或通过邮递员,我得到以下字符串而不是“trans”:“ÐºÐ¾Ð½Ñ‚Ñ € оР»ÑŒÐ½Ð ° Ñ Ñ € Ð ° Ð ± отР°".
我试过将它转换成一堆不同的编码,但 none 给出了正确的值。我也不同意完整请求的英文部分是正确的,但是应该是英文的翻译显示错误,显示原始翻译的俄语部分也显示错误。
无论我在 C# 中尝试不同的编码(utf7、utf8、utf16、utf16-be)时如何更改其编码,我从中得到的文本似乎都不会转换回测试。
我在这里遗漏了什么吗?
尝试请求的代码、手动下载文件的结果以及运行代码的结果如下所示:
代码:
import json
import requests
text = "контрольная работа"
lang = "en"
url = f"https://clients5.google.com/translate_a/t?client=dict-chrome-ex&sl=auto&tl={lang}&q={text}"
url = url.replace(" ", "%20")
res = requests.get(url)
res = res.text
jres = json.loads(res)
translation = jres["sentences"][0]["trans"]
print(res, end="\n\n")
print("\t", translation)
手动下载(点击chrome中的link下载文件):
{
"sentences": [
{
"trans": "test",
"orig": "контрольная работа",
"backend": 10
},
{
"src_translit": "kontrol'naya rabota"
}
],
"dict": [
{
"pos": "noun",
"terms": [
"test"
],
"entry": [
{
"word": "test",
"reverse_translation": [
"тест",
"испытание",
"анализ",
"проверка",
"критерий",
"контрольная работа"
],
"score": 0.18498141
}
],
"base_form": "контрольная работа",
"pos_enum": 1
}
],
"src": "ru",
"alternative_translations": [
{
"src_phrase": "контрольная работа",
"alternative": [
{
"word_postproc": "test",
"score": 1000,
"has_preceding_space": true,
"attach_to_next_token": false,
"backends": [
10
]
},
{
"word_postproc": "test work",
"score": 0,
"has_preceding_space": true,
"attach_to_next_token": false,
"backends": [
3
]
}
],
"srcunicodeoffsets": [
{
"begin": 0,
"end": 18
}
],
"raw_src_segment": "контрольная работа",
"start_pos": 0,
"end_pos": 0
}
],
"confidence": 1,
"ld_result": {
"srclangs": [
"ru"
],
"srclangs_confidences": [
1
],
"extended_srclangs": [
"ru"
]
},
"target_inflections": [
{
"written_form": "test",
"features": {
"number": 2
}
},
{
"written_form": "tests",
"features": {
"number": 1
}
}
]
}
在 C# 中使用 www 请求文件(.net framework 3.5,当 www 未被弃用时具有统一引擎)或在 Python 中请求:
{
"sentences": [
{
"trans": "ÐºÐ¾Ð½Ñ‚Ñ € оР»ÑŒÐ½Ð ° Ñ Ñ € Ð ° Ð ± отР°",
"orig": "ÐºÐ¾Ð½Ñ‚Ñ€Ð¾Ð»ÑŒÐ½Ð°Ñ Ñ€Ð°Ð±Ð¾Ñ‚Ð°",
"backend": 3,
"translation_engine_debug_info": [
{
"model_tracking": {
"checkpoint_md5": "ef4a126affdcc2d3c84e987e2d0fb6b1",
"launch_doc": "tea_GermanicB_afdaislbnosvfyyiiw_en_2020q2.md"
}
}
]
}
],
"src": "is",
"alternative_translations": [
{
"src_phrase": "ÐºÐ¾Ð½Ñ‚Ñ€Ð¾Ð»ÑŒÐ½Ð°Ñ Ñ€Ð°Ð±Ð¾Ñ‚Ð°",
"alternative": [
{
"word_postproc": "ÐºÐ¾Ð½Ñ‚Ñ € оР»ÑŒÐ½Ð ° Ñ Ñ € Ð ° Ð ± отР°",
"score": 0,
"has_preceding_space": true,
"attach_to_next_token": false,
"backends": [
3
]
},
{
"word_postproc": "ÐºÐ¾Ð½Ñ‚Ñ € оР»ÑŒÐ½Ð ° Ñ Ñ € Ð ° Ð °",
"score": 0,
"has_preceding_space": true,
"attach_to_next_token": false,
"backends": [
8
]
}
],
"srcunicodeoffsets": [
{
"begin": 0,
"end": 35
}
],
"raw_src_segment": "ÐºÐ¾Ð½Ñ‚Ñ€Ð¾Ð»ÑŒÐ½Ð°Ñ Ñ€Ð°Ð±Ð¾Ñ‚Ð°",
"start_pos": 0,
"end_pos": 0
}
],
"confidence": 1,
"ld_result": {
"srclangs": [
"is"
],
"srclangs_confidences": [
1
],
"extended_srclangs": [
"is"
]
}
}
因为它直接与 Chrome 一起工作,所以我添加了一个 Chrome 用户代理 header 并且它工作正常:
import json
import requests
from pprint import pprint
url = 'https://clients5.google.com/translate_a/t'
params = {'client': 'dict-chrome-ex',
'sl': 'auto',
'tl': 'en',
'q': 'контрольная работа'}
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.71 Safari/537.36'}
r = requests.get(url,params=params,headers=headers)
jres = r.json()
print(json.dumps(jres, indent=2, ensure_ascii=False))
输出:
{
"sentences": [
{
"trans": "test",
"orig": "контрольная работа",
"backend": 10
},
{
"src_translit": "kontrol'naya rabota"
}
],
"dict": [
{
"pos": "noun",
"terms": [
"test"
],
"entry": [
{
"word": "test",
"reverse_translation": [
"тест",
"испытание",
"анализ",
"проверка",
"критерий",
"контрольная работа"
],
"score": 0.18498141
}
],
"base_form": "контрольная работа",
"pos_enum": 1
}
],
"src": "ru",
"alternative_translations": [
{
"src_phrase": "контрольная работа",
"alternative": [
{
"word_postproc": "test",
"score": 1000,
"has_preceding_space": true,
"attach_to_next_token": false,
"backends": [
10
]
},
{
"word_postproc": "control work",
"score": 0,
"has_preceding_space": true,
"attach_to_next_token": false,
"backends": [
3
]
}
],
"srcunicodeoffsets": [
{
"begin": 0,
"end": 18
}
],
"raw_src_segment": "контрольная работа",
"start_pos": 0,
"end_pos": 0
}
],
"confidence": 1,
"ld_result": {
"srclangs": [
"ru"
],
"srclangs_confidences": [
1
],
"extended_srclangs": [
"ru"
]
},
"target_inflections": [
{
"written_form": "test",
"features": {
"number": 2
}
},
{
"written_form": "tests",
"features": {
"number": 1
}
}
]
}
我正在尝试使用以下 google translate api 端点来翻译应用程序中的文本: https://clients5.google.com/translate_a/t?client=dict-chrome-ex&sl=auto&tl=en&q=контрольная%20работа
当我点击 link 时,它会下载一个文本文件,打开时包含我需要的所有信息,格式似乎正确(sentences[0].trans = "text" 是一样的格式就像我手动写出“文本”这个词一样)。
然而,在 C# 中使用 www 文件请求时,在 python 中使用 requests.get,或通过邮递员,我得到以下字符串而不是“trans”:“ÐºÐ¾Ð½Ñ‚Ñ € оР»ÑŒÐ½Ð ° Ñ Ñ € Ð ° Ð ± отР°".
我试过将它转换成一堆不同的编码,但 none 给出了正确的值。我也不同意完整请求的英文部分是正确的,但是应该是英文的翻译显示错误,显示原始翻译的俄语部分也显示错误。
无论我在 C# 中尝试不同的编码(utf7、utf8、utf16、utf16-be)时如何更改其编码,我从中得到的文本似乎都不会转换回测试。
我在这里遗漏了什么吗?
尝试请求的代码、手动下载文件的结果以及运行代码的结果如下所示:
代码:
import json
import requests
text = "контрольная работа"
lang = "en"
url = f"https://clients5.google.com/translate_a/t?client=dict-chrome-ex&sl=auto&tl={lang}&q={text}"
url = url.replace(" ", "%20")
res = requests.get(url)
res = res.text
jres = json.loads(res)
translation = jres["sentences"][0]["trans"]
print(res, end="\n\n")
print("\t", translation)
手动下载(点击chrome中的link下载文件):
{
"sentences": [
{
"trans": "test",
"orig": "контрольная работа",
"backend": 10
},
{
"src_translit": "kontrol'naya rabota"
}
],
"dict": [
{
"pos": "noun",
"terms": [
"test"
],
"entry": [
{
"word": "test",
"reverse_translation": [
"тест",
"испытание",
"анализ",
"проверка",
"критерий",
"контрольная работа"
],
"score": 0.18498141
}
],
"base_form": "контрольная работа",
"pos_enum": 1
}
],
"src": "ru",
"alternative_translations": [
{
"src_phrase": "контрольная работа",
"alternative": [
{
"word_postproc": "test",
"score": 1000,
"has_preceding_space": true,
"attach_to_next_token": false,
"backends": [
10
]
},
{
"word_postproc": "test work",
"score": 0,
"has_preceding_space": true,
"attach_to_next_token": false,
"backends": [
3
]
}
],
"srcunicodeoffsets": [
{
"begin": 0,
"end": 18
}
],
"raw_src_segment": "контрольная работа",
"start_pos": 0,
"end_pos": 0
}
],
"confidence": 1,
"ld_result": {
"srclangs": [
"ru"
],
"srclangs_confidences": [
1
],
"extended_srclangs": [
"ru"
]
},
"target_inflections": [
{
"written_form": "test",
"features": {
"number": 2
}
},
{
"written_form": "tests",
"features": {
"number": 1
}
}
]
}
在 C# 中使用 www 请求文件(.net framework 3.5,当 www 未被弃用时具有统一引擎)或在 Python 中请求:
{
"sentences": [
{
"trans": "ÐºÐ¾Ð½Ñ‚Ñ € оР»ÑŒÐ½Ð ° Ñ Ñ € Ð ° Ð ± отР°",
"orig": "ÐºÐ¾Ð½Ñ‚Ñ€Ð¾Ð»ÑŒÐ½Ð°Ñ Ñ€Ð°Ð±Ð¾Ñ‚Ð°",
"backend": 3,
"translation_engine_debug_info": [
{
"model_tracking": {
"checkpoint_md5": "ef4a126affdcc2d3c84e987e2d0fb6b1",
"launch_doc": "tea_GermanicB_afdaislbnosvfyyiiw_en_2020q2.md"
}
}
]
}
],
"src": "is",
"alternative_translations": [
{
"src_phrase": "ÐºÐ¾Ð½Ñ‚Ñ€Ð¾Ð»ÑŒÐ½Ð°Ñ Ñ€Ð°Ð±Ð¾Ñ‚Ð°",
"alternative": [
{
"word_postproc": "ÐºÐ¾Ð½Ñ‚Ñ € оР»ÑŒÐ½Ð ° Ñ Ñ € Ð ° Ð ± отР°",
"score": 0,
"has_preceding_space": true,
"attach_to_next_token": false,
"backends": [
3
]
},
{
"word_postproc": "ÐºÐ¾Ð½Ñ‚Ñ € оР»ÑŒÐ½Ð ° Ñ Ñ € Ð ° Ð °",
"score": 0,
"has_preceding_space": true,
"attach_to_next_token": false,
"backends": [
8
]
}
],
"srcunicodeoffsets": [
{
"begin": 0,
"end": 35
}
],
"raw_src_segment": "ÐºÐ¾Ð½Ñ‚Ñ€Ð¾Ð»ÑŒÐ½Ð°Ñ Ñ€Ð°Ð±Ð¾Ñ‚Ð°",
"start_pos": 0,
"end_pos": 0
}
],
"confidence": 1,
"ld_result": {
"srclangs": [
"is"
],
"srclangs_confidences": [
1
],
"extended_srclangs": [
"is"
]
}
}
因为它直接与 Chrome 一起工作,所以我添加了一个 Chrome 用户代理 header 并且它工作正常:
import json
import requests
from pprint import pprint
url = 'https://clients5.google.com/translate_a/t'
params = {'client': 'dict-chrome-ex',
'sl': 'auto',
'tl': 'en',
'q': 'контрольная работа'}
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.71 Safari/537.36'}
r = requests.get(url,params=params,headers=headers)
jres = r.json()
print(json.dumps(jres, indent=2, ensure_ascii=False))
输出:
{
"sentences": [
{
"trans": "test",
"orig": "контрольная работа",
"backend": 10
},
{
"src_translit": "kontrol'naya rabota"
}
],
"dict": [
{
"pos": "noun",
"terms": [
"test"
],
"entry": [
{
"word": "test",
"reverse_translation": [
"тест",
"испытание",
"анализ",
"проверка",
"критерий",
"контрольная работа"
],
"score": 0.18498141
}
],
"base_form": "контрольная работа",
"pos_enum": 1
}
],
"src": "ru",
"alternative_translations": [
{
"src_phrase": "контрольная работа",
"alternative": [
{
"word_postproc": "test",
"score": 1000,
"has_preceding_space": true,
"attach_to_next_token": false,
"backends": [
10
]
},
{
"word_postproc": "control work",
"score": 0,
"has_preceding_space": true,
"attach_to_next_token": false,
"backends": [
3
]
}
],
"srcunicodeoffsets": [
{
"begin": 0,
"end": 18
}
],
"raw_src_segment": "контрольная работа",
"start_pos": 0,
"end_pos": 0
}
],
"confidence": 1,
"ld_result": {
"srclangs": [
"ru"
],
"srclangs_confidences": [
1
],
"extended_srclangs": [
"ru"
]
},
"target_inflections": [
{
"written_form": "test",
"features": {
"number": 2
}
},
{
"written_form": "tests",
"features": {
"number": 1
}
}
]
}