从专用服务器获取 HTTPS 网站的 wget

Question

我最近从 HTTP 网站迁移到 HTTPS。为了使用 mnogosearch 搜索引擎索引该网站的所有页面，我需要执行一个名为 "indexer" 的包含在 mnogosearch 中的脚本，它实际上获取网站的所有页面并将它们索引到 MySQL table.

必须从托管 HTTP 服务器的机器调用此 "indexer" 脚本，即从虚拟专用服务器 (VPS)。

此脚本在我网站的 HTTP 版本上运行良好，但我的 HTTPS 索引有问题。

事实上，为了能够索引 HTTPS 页面，我使用 "virtual scheme as an external retrieval system" 从这个 link : [http://www.mnogosearch.org/doc/msearch-extended-indexing.html][1]

允许使用外部程序获取HTTPS页面的内容。

将外部程序放入名为 "curl.sh" 的脚本中有效：

#!/bin/sh
wget -r --no-check-certificate

问题是这个“wget -r --no-check-certificate https://example.com/”命令可以在我的本地机器上运行（它会下载我网站 "example.com" 的所有页面）但是当我直接从我的 VPS 托管我的 HTTPS 服务器的地方（即 example.com）。

第二种情况，只下载index.html.

这是我在主机上执行递归 wget 时得到的结果：

$ wget -r --no-check-certificate https://example.com/
--2015-09-06 22:22:12--  https://example.com/
Résolution de example.com (example.com)... 
Connexion vers example.com (example.com)...connecté.
Le propriétaire du certificat ne concorde pas avec le nom de l'hôte «example.com»
requête HTTP transmise, en attente de la réponse...200 OK
Longueur: 177 [text/html]a
Sauvegarde en : «example.com/index.html»

100%[========================================================================================================================================>] 177         --.-K/s   ds 0s      

2015-09-06 22:22:12 (5,08 MB/s) - «example.com/index.html» sauvegardé [177/177]

FINISHED --2015-09-06 22:22:12--
Total wall clock time: 0,5s
Downloaded: 1 files, 177 in 0s (5,08 MB/s)

并且 index.html 无效，这是它的内容：

<html><body><h1>It works!</h1>
<p>This is the default web page for this server.</p>
<p>The web server software is running but no content has been added, yet.</p>
</body></html>

我让你注意到我的 HTTPS 服务器可以在 8443 端口上访问（我做了一个重写规则，将 HTTPS 443 请求重定向到 8443 端口）。

所以我也尝试了:

wget -r --no-check-certificate https://example.com:8443/

在这种情况下，wget 显然会尝试获取所有页面，但我对每个页面都有 404 错误：

$ wget -r --no-check-certificate https://example.com:8443/
--2015-09-06 22:39:03--  https://example.com:8443/
Résolution de example.com (example.com)... 
Connexion vers example.com (example.com)||:8443...connecté.
requête HTTP transmise, en attente de la réponse...303 See Other
Emplacement: index.html [suivant]
--2015-09-06 22:39:04--  https://example.com:8443/index.html
Réutilisation de la connexion existante vers example.com:8443.
requête HTTP transmise, en attente de la réponse...200 OK
Longueur: 7389 (7,2K) [text/html]
Sauvegarde en : «example.com:8443/index.html»

100%[========================================================================================================================================>] 7 389       --.-K/s   ds 0s      

2015-09-06 22:39:04 (145 MB/s) - «example.com:8443/index.html» sauvegardé [7389/7389]

Chargement de robots.txt; svp ignorer les erreurs.
--2015-09-06 22:39:04--  https://example.com:8443/robots.txt
Réutilisation de la connexion existante vers example.com:8443.
requête HTTP transmise, en attente de la réponse...200 OK
Longueur: 138 [text/plain]
Sauvegarde en : «example.com:8443/robots.txt»

100%[========================================================================================================================================>] 138         --.-K/s

UPDATE ：我忘了说我在 Apache 后面有一个 Twisted python 服务器，这个 Twisted 服务器正在侦听端口 8443，这就是我这样做的原因从 443 端口重定向到 8443 端口

Answer 1

如果您可以访问服务器，最简单的解决方案可能是更改您的 Apache 配置，使端口 443 与端口 8443 使用相同的 host/virtualhost。然后，如果您尝试下载 https://example.com/ on the server again, all of your absolute links using https://example.com/ 也可以，您可以通过普通端口下载所有内容。

更进一步，我认为您可能想要删除 -r 标志并将 -S -O - 添加到您的 wget 命令行，.看起来您正在使用的软件希望服务器响应的 headers 和 body 在控制台上输出，而不是保存到文件中。

从专用服务器获取 HTTPS 网站的 wget

wget for HTTPS website from dedicated server

https

curl

search-engine

wget