将“.eml”文件传输到 Google Cloud Platform 时出现 UnicodeEncodeError(Linux 上的 gsutil v4.6.1)

UnicodeEncodeError while transferring ".eml" file to Google Cloud Platform (gsutil v4.6.1 on Linux)

使用 gsutil cp 命令将文件从 Linux 系统传输到 Google Cloud Platform 时,尝试处理某些旧的“.eml”文件时失败它的内容(不仅仅是文件名!)包含未以 Unicode 编码的非英语字符。

尝试的命令是:

gsutil cp "/home/darsenlu/Home/mail/Pan/Fw_ japanese_lyrics.eml" gs://darsen_backup_monthly/

错误信息是:

UnicodeEncodeError: 'ascii' codec can't encode character '\udca8' in position 22881: ordinal not in range(128)

gsutil rsync 给出了非常相似的错误。位置 22881 (0x5961) 结果指向多部分电子邮件源文件的末尾。十六进制转储文件内容如下:

00005960: 20a8 43a4 d1b3 a320 5961 686f 6f21 a95f   .C.... Yahoo!._
00005970: bcaf 203e 2020 7777 772e 7961 686f 6f2e  .. >  www.yahoo.
00005980: 636f 6d2e 7477 0d0a                      com.tw..

我们在位置 0x5961 处看到字节“0xa8”,这是错误消息所指示的问题根源。出于某种原因 gsutil 试图对文本进行编码。在支持中文字符的终端打开文件时,我们看到:

< 每天都 Yahoo!奇摩 >  www.yahoo.com.tw

首个汉字“每”Big-5编码为0xa843。一个简单的解决方法是将文件扩展名重命名为“.eml”以外的其他名称,例如“.eml.bak”,这样 gsutil 就不会处理文件内容。不幸的是,在进行批量传输时,很难事先知道是否存在这种非英文字符的文件,整个过程可以多次停止。

以下是完整的错误信息:

darsenlu@devmodel:~/Home$ gsutil cp "/home/darsenlu/Home/mail/Pan/Fw_ japanese_lyrics.eml" gs://darsen_backup_monthly/
Copying file:///home/darsenlu/Home/mail/Pan/Fw_ japanese_lyrics.eml [Content-Type=message/rfc822]...
Traceback (most recent call last):
  File "/usr/lib/google-cloud-sdk/platform/gsutil/gsutil", line 21, in <module>
    gsutil.RunMain()
  File "/usr/lib/google-cloud-sdk/platform/gsutil/gsutil.py", line 122, in RunMain
    sys.exit(gslib.__main__.main())
  File "/usr/lib/google-cloud-sdk/platform/gsutil/gslib/__main__.py", line 444, in main
    user_project=user_project)
  File "/usr/lib/google-cloud-sdk/platform/gsutil/gslib/__main__.py", line 780, in _RunNamedCommandAndHandleExceptions
    _HandleUnknownFailure(e)
  File "/usr/lib/google-cloud-sdk/platform/gsutil/gslib/__main__.py", line 639, in _RunNamedCommandAndHandleExceptions
    user_project=user_project)
  File "/usr/lib/google-cloud-sdk/platform/gsutil/gslib/command_runner.py", line 411, in RunNamedCommand
    return_code = command_inst.RunCommand()
  File "/usr/lib/google-cloud-sdk/platform/gsutil/gslib/commands/cp.py", line 1124, in RunCommand
    seek_ahead_iterator=seek_ahead_iterator)
  File "/usr/lib/google-cloud-sdk/platform/gsutil/gslib/command.py", line 1525, in Apply
    arg_checker, should_return_results, fail_on_error)
  File "/usr/lib/google-cloud-sdk/platform/gsutil/gslib/command.py", line 1596, in _SequentialApply
    worker_thread.PerformTask(task, self)
  File "/usr/lib/google-cloud-sdk/platform/gsutil/gslib/command.py", line 2316, in PerformTask
    results = task.func(cls, task.args, thread_state=self.thread_gsutil_api)
  File "/usr/lib/google-cloud-sdk/platform/gsutil/gslib/commands/cp.py", line 709, in _CopyFuncWrapper
    preserve_posix=cls.preserve_posix_attrs)
  File "/usr/lib/google-cloud-sdk/platform/gsutil/gslib/commands/cp.py", line 924, in CopyFunc
    preserve_posix=preserve_posix)
  File "/usr/lib/google-cloud-sdk/platform/gsutil/gslib/utils/copy_helper.py", line 3957, in PerformCopy
    gzip_encoded=gzip_encoded)
  File "/usr/lib/google-cloud-sdk/platform/gsutil/gslib/utils/copy_helper.py", line 2250, in _UploadFileToObject
    parallel_composite_upload, logger)
  File "/usr/lib/google-cloud-sdk/platform/gsutil/gslib/utils/copy_helper.py", line 2066, in _DelegateUploadFileToObject
    elapsed_time, uploaded_object = upload_delegate()
  File "/usr/lib/google-cloud-sdk/platform/gsutil/gslib/utils/copy_helper.py", line 2227, in CallNonResumableUpload
    gzip_encoded=gzip_encoded_file)
  File "/usr/lib/google-cloud-sdk/platform/gsutil/gslib/utils/copy_helper.py", line 1762, in _UploadFileToObjectNonResumable
    gzip_encoded=gzip_encoded)
  File "/usr/lib/google-cloud-sdk/platform/gsutil/gslib/cloud_api_delegator.py", line 388, in UploadObject
    gzip_encoded=gzip_encoded)
  File "/usr/lib/google-cloud-sdk/platform/gsutil/gslib/gcs_json_api.py", line 1712, in UploadObject
    gzip_encoded=gzip_encoded)
  File "/usr/lib/google-cloud-sdk/platform/gsutil/gslib/gcs_json_api.py", line 1534, in _UploadObject
    global_params=global_params)
  File "/usr/lib/google-cloud-sdk/platform/gsutil/gslib/third_party/storage_apitools/storage_v1_client.py", line 1182, in Insert
    upload=upload, upload_config=upload_config)
  File "/usr/lib/google-cloud-sdk/platform/gsutil/third_party/apitools/apitools/base/py/base_api.py", line 703, in _RunMethod
    download)
  File "/usr/lib/google-cloud-sdk/platform/gsutil/third_party/apitools/apitools/base/py/base_api.py", line 679, in PrepareHttpRequest
    upload.ConfigureRequest(upload_config, http_request, url_builder)
  File "/usr/lib/google-cloud-sdk/platform/gsutil/third_party/apitools/apitools/base/py/transfer.py", line 763, in ConfigureRequest
    self.__ConfigureMultipartRequest(http_request)
  File "/usr/lib/google-cloud-sdk/platform/gsutil/third_party/apitools/apitools/base/py/transfer.py", line 823, in __ConfigureMultipartRequest
    g.flatten(msg_root, unixfrom=False)
  File "/usr/lib/python3.6/email/generator.py", line 116, in flatten
    self._write(msg)
  File "/usr/lib/python3.6/email/generator.py", line 181, in _write
    self._dispatch(msg)
  File "/usr/lib/python3.6/email/generator.py", line 214, in _dispatch
    meth(msg)
  File "/usr/lib/python3.6/email/generator.py", line 272, in _handle_multipart
    g.flatten(part, unixfrom=False, linesep=self._NL)
  File "/usr/lib/python3.6/email/generator.py", line 116, in flatten
    self._write(msg)
  File "/usr/lib/python3.6/email/generator.py", line 181, in _write
    self._dispatch(msg)
  File "/usr/lib/python3.6/email/generator.py", line 214, in _dispatch
    meth(msg)
  File "/usr/lib/python3.6/email/generator.py", line 361, in _handle_message
    payload = self._encode(payload)
  File "/usr/lib/python3.6/email/generator.py", line 412, in _encode
    return s.encode('ascii')
UnicodeEncodeError: 'ascii' codec can't encode character '\udca8' in position 22881: ordinal not in range(128)

Linux 系统是 Ubuntu 18.04.4 LTS(GNU/Linux 4.15.0-76-generic x86_64)。

我把你的中文字符串拿走了,并且能够重现你的错误。我更新到 gsutil 4.62 后修复了它。这里是 merged PR and issue tracker 作为参考。

通过 运行 更新 Cloud SDK:

gcloud components update