有没有办法忽略 302 Moved Temporarily 重定向或找到它是由什么引起的？

Question

我正在编写一些解析脚本，需要访问许多这样的网页 one。

每当我尝试使用 urlopen 然后 read() 获取此页面时，我都会被重定向到此 page.

当我从 google 启动相同的链接时 chrome 重定向发生但实际上很少，大多数时候我尝试启动 url 不是通过从站点菜单单击它。

有没有办法避免使用 python3 从网站菜单重定向或模拟跳转到 url？

示例代码：

def getItemsFromPage(url):
    with urlopen(url) as page:
        html_doc = str(page.read())
    return re.findall('(http://www.charitynavigator.org/index.cfm\?bay=search\.summary&amp;orgid=[\d]+)', html_doc)

url = 'http://www.charitynavigator.org/index.cfm?bay=search.alpha&ltr=1'
items_urls = getItemsFromPage(url)
with urlopen(item_urls[0]) as item_page:
    print(item_page.read().decode('utf-8')) # Here i get search.advanced instead of item page

Answer 1

您的问题不是在 url 字符串中将 & 替换为 &。我使用 urllib3 重写了您的代码，如下所示，并获得了预期的网页。

import re
import urllib3

def getItemsFromPage(url):
    # create connection pool object (urllib3-specific)
    localpool = urllib3.PoolManager() 
    with localpool.request('get', url) as page:
        html_doc = page.data.decode('utf-8')
    return re.findall('(http://www.charitynavigator.org/index.cfm\?bay=search\.summary&amp;orgid=[\d]+)', html_doc)

# the master webpage
url_master = 'http://www.charitynavigator.org/index.cfm?bay=search.alpha&ltr=1'
# name and store the downloaded contents for testing purpose.
file_folder = "R:"
file_mainname = "test"

# parse the master webpage
items_urls = getItemsFromPage(url_master)

# create pool
mypool = urllib3.PoolManager()

i=0;
for url in items_urls:
    # file name to be saved
    file_name = file_folder + "\" + file_mainname + str(i) + ".htm"
    # replace '&amp;' with r'&'
    url_OK = re.sub(r'&amp;', r'&', url)
    # print revised url
    print(url_OK) 
    ### the urllib3-pythonic way of web page retrieval ###
    with mypool.request('get', url_OK) as page, open(file_name, 'w') as f:
        f.write(page.data.decode('utf-8'))
    i+=1

（在 python 3.4 eclipse PyDev win7 x64 上验证）

Answer 2

事实上，原始 html 数据中的符号是一个奇怪的问题。当您访问该网页并单击 link 时，网络导航器会将和号 (&) 读取为“&”，并且可以正常工作。但是，python按原样读取数据，即原始数据。所以：

import urllib.request as net
from html.parser import HTMLParser
import re 

headers = {
    "User-Agent": "Mozilla/5.0 (X11; Linux x86_64; rv:36.0) Gecko/20100101 Firefox/36.0",
}

def unescape(items):
    html = HTMLParser()
    unescaped = []
    for i in items:
        unescaped.append(html.unescape(i))

    return unescaped


def getItemsFromPage(url):
    request = net.Request(url, headers=headers)
    response = str(net.urlopen(request).read())
    # --------------------------
    # FIX AMPERSANDS - unescape
    # --------------------------
    links = re.findall('(http://www.charitynavigator.org/index.cfm\?bay=search\.summary&amp;orgid=[\d]+)', response)
    unescaped_links = unescape(links)

    return unescaped_links


url = 'http://www.charitynavigator.org/index.cfm?bay=search.alpha&ltr=1'
item_urls = getItemsFromPage(url)
request = net.Request(item_urls[0], headers=headers)
print(item_urls)
response = net.urlopen(request)

# DEBUG RESPONSE 
print(response.url)
print(80 * '-')

print("<title>Charity Navigator Rating - 10,000 Degrees</title>" in (response.read().decode('utf-8')))

有没有办法忽略 302 Moved Temporarily 重定向或找到它是由什么引起的？

Is there way to ignore 302 Moved Temporarily redirection or find what it is caused by?

python

redirect

urllib

python-3.x