MongoDB - 使用 mongoexport 时出现意外的字符编码
MongoDB - Unexpected character encodings when using mongoexport
我正在对一个集合使用 mongoexport,该集合包含以 utf8 编码的外来字符以及带有字符的字段 mongoexport 似乎是编码(例如,“&”)。我注意到的是 mongo export 对“&”字符进行了 unicode 转义,但未转义“ü”等字符。这是一个问题,因为我正在尝试使用 Python 读取此数据,但由于发生了两种不同的编码,因此无法正确解码。
例如(mongo查询获取记录):
db.Military_Handbooks.findOne({_id: ObjectId("5bf61c80e173a2a10b53ad39")}).PRIMARY_AUTHOR
[
"Dürer, Albrecht",
[
[
"http://ustc.ac.uk/index.php/search/cicero?tm_fulltext=&tm_field_allauthr=Dürer, Albrecht&tm_translator=&tm_editor=&tm_field_short_title=&tm_field_imprint=&tm_field_place=&sm_field_year=&f_sm_field_year=&t_sm_field_year=&sm_field_country=&sm_field_lang=&sm_field_format=&sm_field_digital=&tm_field_class=&tm_field_cit_name=&tm_field_cit_no=&order=",
" Dürer, Albrecht"
]
]
]
运行 以下 mongo 导出命令(如果导出到 json 也是一样):
mongoexport--db ustc --collection Military_Handbooks --type=csv -f=PRIMARY_AUTHOR --limit=1
"[""Dürer, Albrecht"",[[""http://ustc.ac.uk/index.php/search/cicero?tm_fulltext=\u0026tm_field_allauthr=Dürer, Albrecht\u0026tm_translator=\u0026tm_editor=\u0026tm_field_short_title=\u0026tm_field_imprint=\u0026tm_field_place=\u0026sm_field_year=\u0026f_sm_field_year=\u0026t_sm_field_year=\u0026sm_field_country=\u0026sm_field_lang=\u0026sm_field_format=\u0026sm_field_digital=\u0026tm_field_class=\u0026tm_field_cit_name=\u0026tm_field_cit_no=\u0026order="","" Dürer, Albrecht""]]]"
当试图将其读入 Python 时:
In [24]: import pandas
In [25]: c = pandas.read_csv('Military_Handbooks2.csv')
In [26]: c.at[1, 'PRIMARY_AUTHOR']
Out[26]: '["Dürer, Albrecht",[["http://ustc.ac.uk/index.php/search/cicero?tm_fulltext=\u0026tm_field_allauthr=Dürer, Albrecht\u0026tm_translator=\u0026tm_editor=\u0026tm_field_short_title=\u0026tm_field_imprint=\u0026tm_field_place=\u0026sm_field_year=\u0026f_sm_field_year=\u0026t_sm_field_year=\u0026sm_field_country=\u0026sm_field_lang=\u0026sm_field_format=\u0026sm_field_digital=\u0026tm_field_class=\u0026tm_field_cit_name=\u0026tm_field_cit_no=\u0026order="," Dürer, Albrecht"]]]'
In [27]: c.at[1, 'PRIMARY_AUTHOR'].encode().decode('unicode-escape')
Out[27]: '["Dürer, Albrecht",[["http://ustc.ac.uk/index.php/search/cicero?tm_fulltext=&tm_field_allauthr=Dürer, Albrecht&tm_translator=&tm_editor=&tm_field_short_title=&tm_field_imprint=&tm_field_place=&sm_field_year=&f_sm_field_year=&t_sm_field_year=&sm_field_country=&sm_field_lang=&sm_field_format=&sm_field_digital=&tm_field_class=&tm_field_cit_name=&tm_field_cit_no=&order="," Dürer, Albrecht"]]]'
规格:
OS:Ubuntu 18.04.1 LTS
Python: 3.6.7
MongoDB shell 版本 v3.6.9
最后在忽略错误的同时重新编码文件似乎已经成功了。
def encoding():
for fn in os.listdir('.'):
if '2' not in fn and 'failed' not in fn and 'decode' not in fn:
try:
with codecs.open(fn, encoding='utf-8') as fd:
text = fd.read()
text = text.encode('Windows-1252', errors='ignore').decode('utf-8', errors='ignore')
with codecs.open(fn[:fn.rfind('.')]+'2.csv', 'w', encoding='utf-8') as fd:
fd.write(text)
except Exception as ex:
print(ex)
print('*'*50, '\n')
我还应该注意到我链接到这个 post 这很有帮助:how to export correctly accented words with mongoexport.
我正在对一个集合使用 mongoexport,该集合包含以 utf8 编码的外来字符以及带有字符的字段 mongoexport 似乎是编码(例如,“&”)。我注意到的是 mongo export 对“&”字符进行了 unicode 转义,但未转义“ü”等字符。这是一个问题,因为我正在尝试使用 Python 读取此数据,但由于发生了两种不同的编码,因此无法正确解码。
例如(mongo查询获取记录):
db.Military_Handbooks.findOne({_id: ObjectId("5bf61c80e173a2a10b53ad39")}).PRIMARY_AUTHOR
[
"Dürer, Albrecht",
[
[
"http://ustc.ac.uk/index.php/search/cicero?tm_fulltext=&tm_field_allauthr=Dürer, Albrecht&tm_translator=&tm_editor=&tm_field_short_title=&tm_field_imprint=&tm_field_place=&sm_field_year=&f_sm_field_year=&t_sm_field_year=&sm_field_country=&sm_field_lang=&sm_field_format=&sm_field_digital=&tm_field_class=&tm_field_cit_name=&tm_field_cit_no=&order=",
" Dürer, Albrecht"
]
]
]
运行 以下 mongo 导出命令(如果导出到 json 也是一样):
mongoexport--db ustc --collection Military_Handbooks --type=csv -f=PRIMARY_AUTHOR --limit=1
"[""Dürer, Albrecht"",[[""http://ustc.ac.uk/index.php/search/cicero?tm_fulltext=\u0026tm_field_allauthr=Dürer, Albrecht\u0026tm_translator=\u0026tm_editor=\u0026tm_field_short_title=\u0026tm_field_imprint=\u0026tm_field_place=\u0026sm_field_year=\u0026f_sm_field_year=\u0026t_sm_field_year=\u0026sm_field_country=\u0026sm_field_lang=\u0026sm_field_format=\u0026sm_field_digital=\u0026tm_field_class=\u0026tm_field_cit_name=\u0026tm_field_cit_no=\u0026order="","" Dürer, Albrecht""]]]"
当试图将其读入 Python 时:
In [24]: import pandas
In [25]: c = pandas.read_csv('Military_Handbooks2.csv')
In [26]: c.at[1, 'PRIMARY_AUTHOR']
Out[26]: '["Dürer, Albrecht",[["http://ustc.ac.uk/index.php/search/cicero?tm_fulltext=\u0026tm_field_allauthr=Dürer, Albrecht\u0026tm_translator=\u0026tm_editor=\u0026tm_field_short_title=\u0026tm_field_imprint=\u0026tm_field_place=\u0026sm_field_year=\u0026f_sm_field_year=\u0026t_sm_field_year=\u0026sm_field_country=\u0026sm_field_lang=\u0026sm_field_format=\u0026sm_field_digital=\u0026tm_field_class=\u0026tm_field_cit_name=\u0026tm_field_cit_no=\u0026order="," Dürer, Albrecht"]]]'
In [27]: c.at[1, 'PRIMARY_AUTHOR'].encode().decode('unicode-escape')
Out[27]: '["Dürer, Albrecht",[["http://ustc.ac.uk/index.php/search/cicero?tm_fulltext=&tm_field_allauthr=Dürer, Albrecht&tm_translator=&tm_editor=&tm_field_short_title=&tm_field_imprint=&tm_field_place=&sm_field_year=&f_sm_field_year=&t_sm_field_year=&sm_field_country=&sm_field_lang=&sm_field_format=&sm_field_digital=&tm_field_class=&tm_field_cit_name=&tm_field_cit_no=&order="," Dürer, Albrecht"]]]'
规格:
OS:Ubuntu 18.04.1 LTS
Python: 3.6.7
MongoDB shell 版本 v3.6.9
最后在忽略错误的同时重新编码文件似乎已经成功了。
def encoding():
for fn in os.listdir('.'):
if '2' not in fn and 'failed' not in fn and 'decode' not in fn:
try:
with codecs.open(fn, encoding='utf-8') as fd:
text = fd.read()
text = text.encode('Windows-1252', errors='ignore').decode('utf-8', errors='ignore')
with codecs.open(fn[:fn.rfind('.')]+'2.csv', 'w', encoding='utf-8') as fd:
fd.write(text)
except Exception as ex:
print(ex)
print('*'*50, '\n')
我还应该注意到我链接到这个 post 这很有帮助:how to export correctly accented words with mongoexport.