如果错误代码为 404，如何在 wget 中获取准确的页面内容

Question

我有两个 url 一个在工作 url 另一个被删除了页面 url.working url 没问题但是页面被删除 url 而不是得到确切的页面内容 wget 收到 404

工作url

import os
def curl(url):
    data = os.popen('wget -qO- %s '% url).read()
    print (url)
    print (len(data))
    #print (data)

curl("https://www.reverbnation.com/artist_41/bio")

输出：

https://www.reverbnation.com/artist_41/bio
80067

页面已删除url

import os
def curl(url):
    data = os.popen('wget -qO- %s '% url).read()
    print (url)
    print (len(data))
    #print (data)

curl("https://www.reverbnation.com/artist_42/bio")

输出：

https://www.reverbnation.com/artist_42/bio
0

我得到的长度为 0，但实时页面中有一些内容

如何在wget或c中得到准确的内容url

Answer 1

wget 有一个名为“--content-on-error”的开关：

--content-on-error
           If this is set to on, wget will not skip the content when the server responds with a http status code that indicates error.

所以只需将它添加到您的代码中，您也将拥有 404 页面中的 "content"：

import os
def curl(url):
    data = os.popen('wget --content-on-error -qO- %s '% url).read()
    print (url)
    print (len(data))
    #print (data)

curl("https://www.reverbnation.com/artist_42/bio")

如果错误代码为 404，如何在 wget 中获取准确的页面内容

How to get exact page content in wget if error code is 404

curl

wget

web-scraping

python-3.x