download_as_text 来自 Google 云存储 Blob/对象导致 UnicodeDecodeError

Question

我正在尝试从存储为 Cloud Storage Blob / Object 的 PDF 中获取人类可读的文本。文档告诉我 download_as_string() 方法已被弃用，取而代之的是使用 download_as_bytes() 将 blob 的内容下载为字节对象。

json_string = blob_list[0].download_as_bytes() 
print(json_string)

当我运行上面的代码时，blob 的内容被下载为字节对象，但这不是人类可读的，也不是我要找的。

接下来我尝试同时使用 download_as_text() 和 download_as_text().decode() 但是这两种方法都抛出了以下错误：return data.decode("utf-8") UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe2 in position 10: invalid continuation byte

我最后的尝试是使用 download_as_bytes().decode('ISO-8859-1')，它不会导致错误，但不会 return 人类可读的文本。

我做错了什么？如何从云存储 Blob/对象获取文本？

Answer 1

PDF 文件由二进制数据而不是文本组成。这意味着它们不能以任何有意义的方式表示为 Unicode 字符串。来自 PDF 的 Google Cloud Vision API 和 vision.Feature.Type.DOCUMENT_TEXT_DETECTION can be used to get the text。要阅读 PDF，Cloud Vision 太过分了。

程序可以读取和处理 PDF，因为它们具有结构化格式。还有许多图书馆可以阅读和解释 PDF 文件。

download_as_text 来自 Google 云存储 Blob/对象导致 UnicodeDecodeError

download_as_text from Google Cloud Storage Blob / Object causes UnicodeDecodeError

decode

character-encoding

google-cloud-storage

google-cloud-platform

google-cloud-functions