通过 REST 使用 Adob​​e PDF 服务将 PDF 转换为 DOCX API(使用 Python)

Convert a PDF to DOCX using Adobe PDF Services via REST API (with Python)

我正在尝试查询 Adob​​e PDF 服务 API 以从 PDF 文档生成(导出)DOCX。

我刚刚编写了一个 python 代码来生成一个 Bearer Token 以便在 Adobe PDF 服务 中被识别(在这里查看问题:). Then I wrote the following piece of code, where I tried to follow the instruction in this page concerning the EXPORT option of Adobe PDF services (here: https://documentcloud.adobe.com/document-services/index.html#post-exportPDF).

这是一段代码:

import requests
import json
from requests.structures import CaseInsensitiveDict
N/B:生成Token和服务端识别的部分代码我没有写
>> 这部分是 POST 通过表单参数上传我的 PDF 文件的请求
URL = "https://cpf-ue1.adobe.io/ops/:create?respondWith=%257B%2522reltype%2522%253A%2520%2522http%253A%252F%252Fns.adobe.com%252Frel%252Fprimary%2522%257D"

headers = CaseInsensitiveDict()
headers["x-api-key"] = "client_id"
headers["Authorization"] = "Bearer MYREALLYLONGTOKENIGOT"
headers["Content-Type"] = "application/json"

myfile = {"file":open("absolute_path_to_the_pdf_file/input.pdf", "rb")}

j="""
{
  "cpf:engine": {
    "repo:assetId": "urn:aaid:cpf:Service-26c7fda2890b44ad9a82714682e35888"
  },
  "cpf:inputs": {
    "params": {
      "cpf:inline": {
        "targetFormat": "docx"
      }
    },
    "documentIn": {
      "dc:format": "application/pdf",
      "cpf:location": "C:/Users/a-bensghir/Downloads/P_D_F/trs_pdf_file_copy.pdf"
    }
  },
  "cpf:outputs": {
    "documentOut": {
      "dc:format": "application/vnd.openxmlformats-officedocument.wordprocessingml.document",
      "cpf:location": "C:/Users/a-bensghir/Downloads/P_D_F/output.docx"
    }
  }
}"""

resp = requests.post(url=URL, headers=headers, json=json.dumps(j), files=myfile)
   

print(resp.text)
print(resp.status_code)

代码状态为400 我得到了服务器的充分验证 但是由于print(resp.text)我得到以下结果:

{"requestId":"the_request_id","type":"Bad Request","title":"Not a multipart request. Aborting.","status":400,"report":"{\"error_code\":\"INVALID_MULTIPART_REQUEST\"}"}

我认为我在理解 Adob​​e 指南中有关 API (https://documentcloud.adobe.com/document-services/index.html) 的导出作业的 POST 方法的“表单参数”时遇到了问题。

您有什么改进的想法吗?谢谢!

首先将变量 j 作为 python dict,然后从中创建一个 JSON 字符串。 从 Adob​​e 的文档中也不是很清楚的是 documentIn.cpf:location 的值需要与您的文件使用的密钥相同。我已在您的脚本中将此更正为 InputFile0。还猜想你想保存你的文件,所以我也添加了它。

import requests
import json
import time

URL = "https://cpf-ue1.adobe.io/ops/:create?respondWith=%257B%2522reltype%2522%253A%2520%2522http%253A%252F%252Fns.adobe.com%252Frel%252Fprimary%2522%257D"

headers = {
    'Authorization': f'Bearer {token}',
    'Accept': 'application/json, text/plain, */*',
    'x-api-key': client_id,
    'Prefer': "respond-async,wait=0",
}

myfile = {"InputFile0":open("absolute_path_to_the_pdf_file/input.pdf", "rb")}

j={
  "cpf:engine": {
    "repo:assetId": "urn:aaid:cpf:Service-26c7fda2890b44ad9a82714682e35888"
  },
  "cpf:inputs": {
    "params": {
      "cpf:inline": {
        "targetFormat": "docx"
      }
    },
    "documentIn": {
      "dc:format": "application/pdf",
      "cpf:location": "InputFile0"
    }
  },
  "cpf:outputs": {
    "documentOut": {
      "dc:format": "application/vnd.openxmlformats-officedocument.wordprocessingml.document",
      "cpf:location": "C:/Users/a-bensghir/Downloads/P_D_F/output.docx"
    }
  }
}

body = {"contentAnalyzerRequests": json.dumps(j)}

resp = requests.post(url=URL, headers=headers, data=body, files=myfile)
   

print(resp.text)
print(resp.status_code)

poll = True
while poll:
    new_request = requests.get(resp.headers['location'], headers=headers)
    if new_request.status_code == 200:
        open('test.docx', 'wb').write(new_request.content)
        poll = False
    else:
        time.sleep(5)

I don't know why the docx file (its well created by the way) doesn't open, telling via popup that the content is not readable. maybe it's due to the 'wb' parsing methos

我遇到了同样的问题。类型转换为 'bytes' 请求内容解决了它。

poll = True
    while poll:
        new_request = requests.get(resp.headers['location'], headers=headers)
        if new_request.status_code == 200:
            with open('test.docx', 'wb') as f:
                f.write(bytes(new_request.content))
            poll = False
        else:
            time.sleep(5)