我无法打开的 PDF 下载问题

Question

我正在编写一个脚本，使用 https://case.law/docs/site_features/api 从法律案件中提取文本。我已经创建了用于搜索和 create-xlsx 的方法，它们运行良好，但我正在努力解决打开在线 pdf link、在临时文件中写入 (wb)、读取和提取数据（核心文本）的方法)，然后关闭它。最终目的是将这些案例的内容用于NLP

我准备了下载文件的功能（见下文）：

def download_file(file_id):
    http = urllib3.PoolManager()
    folder_path = "path_to_my_desktop"
    file_download = "https://cite.case.law/xxxxxx.pdf"
    file_content = http.request('GET', file_download)
    file_local = open( folder_path + file_id + '.pdf', 'wb' )
    file_local.write(file_content.read())
    file_content.close()
    file_local.close()

该脚本在下载文件并在我的桌面上创建时运行良好，但是，当我尝试在桌面上手动打开该文件时，我从 acrobat reader:

Adobe Acrobat Reader could not open 'file_id.pdf' because it is either not a supported file type or because the file has been damager (for example, it was sent as a email attachments and wasn't correctly decoded

我以为是图书馆，所以我尝试使用 Requests / xlswriter / urllib3...（下面的示例 - 我也尝试过从脚本中读取它以查看问题是否出在 Adobe 上，但显然不是）

# Download the pdf from the search results
URL = "https://cite.case.law/xxxxxx.pdf"
r = requests.get(URL, stream=True)
with open('path_to_desktop + pdf_name + .pdf', 'w') as f:
      f.write(r.text)

# open the downloaded file and remove '<[^<]+?>' for easier reading
with open('C:/Users/amallet/Desktop/r.pdf', 'r') as ff:
      data_read = ff.read()
      stripped = re.sub('<[^<]+?>', '', data_read)
      print(stripped)

输出为：

document.getElementById('next').value = document.location.toString();
document.getElementById('not-a-bot-form').submit();

改为使用“wb”和 'rb'（并删除 *** 剥离的 *** 脚本是：

r = requests.get(test_case_pdf, stream=True)
with open('C:/Users/amallet/Desktop/r.pdf', 'wb') as f:
      f.write(r.content)

with open('C:/Users/amallet/Desktop/r.pdf', 'rb') as ff:
      data_read = ff.read()
      print(data_read)

输出为：

<html>
<head>
<noscript>
<meta http-equiv="Refresh" content="0;URL=?no_js=1&next=/pdf/7840543/In%20re%20the%20Extradition%20of%20Garcia,%20890%20F.%20Supp.%20914%
20(1994).pdf" />
</noscript>
</head>
<body>
<form method="post" id="not-a-bot-form">
<input type="hidden" name="csrfmiddlewaretoken" value="5awGW0F4A1b7Y6bx
rYBaA6GIvqx4Tf6DnK0qEMLVoJBLoA3ZqOrpMZdUXDQ7ehOz">
<input type="hidden" name="not_a_bot" value="yes">
<input type="hidden" name="next" value="/pdf/7840543/In%20re%20
the%20Extradition%20of%20Garcia,%20890%20F.%20Supp.%20914%20(1994).pdf" id="next">
</form>
<script>
document.getElementById(\'next\').value = document.loc
ation.toString();
document.getElementById(\'not-a-bot-form\').submit();
</script>
<a href="?no_js=1&next=/pdf/7840543/In%20re%20the%20Extradition%20of%20Garcia,%2
0890%20F.%20Supp.%20914%20(1994).pdf">Click here to continue</a>
</body>
</html>

但 none 正在工作。 pdf没有密码保护，我在其他网站上试过了，也没用。

因此，我想知道是否还有另一个问题不是 link 代码本身。

如果您需要更多信息，请告诉我。

谢谢

Answer 1

看起来网络服务器为您提供的不是 PDF，而是旨在防止机器人从网站下载数据的网页。

您的代码没有任何问题，但如果您仍想这样做，则必须解决网站的机器人程序预防问题。

我无法打开的 PDF 下载问题

Problem with PDF download that I cannot open

python

pdf

data-collection

urllib3

python-requests