使用 wget 直接下载 pdf link

Use wget to download pdf with no direct link

一些网站提供pdf文件供查看,但我无法使用wget下载此类pdf文件。 从我的浏览器调用相关网站查看 pdf: https://www.lokalmatador.de/epaper/ausgabe/gemeinderundschau-muehlhausen-14-2021/

但是使用下面的代码我只能得到一个 0 长度的 pdf 文件。

wget --content-disposition -nd https://www.lokalmatador.de/epaper/ausgabe/gemeinderundschau-muehlhausen-14-2021/

我尝试了一些保存和加载 cookies 和 referer 的组合,但没有任何效果。 在这一点上,我只是好奇发生了什么,为什么 wget 除了可能是一个空 index.html.

之外没有获取任何东西

当我查看服务器响应时,它说内容长度为 0。

--2021-04-17 14:59:35--  https://www.lokalmatador.de/epaper/ausgabe/gemeinderundschau-muehlhausen-14-2021/
Resolving www.lokalmatador.de (www.lokalmatador.de)... 37.202.6.70
Connecting to www.lokalmatador.de (www.lokalmatador.de)|37.202.6.70|:443... connected.
HTTP request sent, awaiting response...
  HTTP/1.1 200 OK
  Date: Sat, 17 Apr 2021 13:59:36 GMT
  Server: Apache
  Set-Cookie: fe_typo_user=477e8a1d2b3dd74bc5b6b408a6d74edd; expires=Mon, 17-May-2021 13:59:36 GMT; Max-Age=2592000; path=/; domain=.lokalmatador.de; httponly; samesite=lax
  Upgrade: h2,h2c
  Connection: Upgrade, Keep-Alive
  Content-Length: Array
  Cache-Control: max-age=2592000
  Expires: Mon, 17 May 2021 13:59:36 GMT
  X-UA-Compatible: IE=edge
  X-Content-Type-Options: nosniff
  Keep-Alive: timeout=5, max=100
  Content-Type: application/pdf
Length: 0 [application/pdf]
Remote file exists but does not contain any link -- not retrieving.

于是看了说明书:

https://www.gnu.org/software/wget/manual/html_node/HTTP-Options.html

并且有一个命令正是用于此:

‘--ignore-length’

    Unfortunately, some HTTP servers (CGI programs, to be more precise) send out bogus Content-Length headers, which makes Wget go wild, as it thinks not all the document was retrieved. You can spot this syndrome if Wget retries getting the same document again and again, each time claiming that the (otherwise normal) connection has closed on the very same byte.

    With this option, Wget will ignore the Content-Length header—as if it never existed. 

然后 wget 命令按预期开始工作:

 wget --ignore-length -O epaper.pdf https://www.lokalmatador.de/epaper/ausgabe/gemeinderundschau-muehlhausen-14-2021

这是我看到的带有忽略长度的输出:

--2021-04-17 14:56:19--  https://www.lokalmatador.de/epaper/ausgabe/gemeinderundschau-muehlhausen-14-2021
Resolving www.lokalmatador.de (www.lokalmatador.de)... 37.202.6.70
Connecting to www.lokalmatador.de (www.lokalmatador.de)|37.202.6.70|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: ignored [application/pdf]
Saving to: ‘epaper.pdf’

epaper.pdf                        [                  <=>                             ]   4.39M  1.23MB/s    in 3.6s

2021-04-17 14:56:23 (1.21 MB/s) - ‘epaper.pdf’ saved [4601842]