在 python 中读取 url 的 .odt 和 .doc 文件

Question

如何使用 python 从 url 的 '.odt' 和 '.doc' 格式文件中提取文本？我试着搜索它，但找不到任何东西。

任何线索都会有所帮助。

from odf import text, teletype
from odf.opendocument import load
 
textdoc = load(r"C:\Users\OMS\Downloads\sample1.odt")
allparas = textdoc.getElementsByType(text.P)
for i in range(len((allparas))):
    a=teletype.extractText(allparas[i])
    print(a)

这适用于本地 .odt 文件，但现在我需要从

"https://abc.s3.ap-south-1.amazonaws.com/sample1.odt"

假设已使用 boto3 完成与 aws s3 的连接。

Answer 1

以下是使用 Python3.6 和 this 测试 odt 文件进行测试的；

import boto3
import io
from odf import text, teletype
from odf.opendocument import load

s3_client = boto3.resource('s3') #TODO: change aws connection logic as per your setup


# TODO: refactor name, readability
def get_contents(file_name):
    obj = s3_client.Object('s3_bucket_name', file_name)  # TODO: change aws s3 bucket name as per your setup
    body = obj.get()['Body'].read()
    return load(io.BytesIO(body))


textdoc = get_contents("test.odt")  # TODO: change odt file name as per your setup
allparas = textdoc.getElementsByType(text.P)
for i in range(len((allparas))):
    a = teletype.extractText(allparas[i])
    print(a)

在 python 中读取 url 的 .odt 和 .doc 文件

Read .odt and .doc File from url in python

python

document