有没有办法忽略 302 Moved Temporarily 重定向或找到它是由什么引起的?

Is there way to ignore 302 Moved Temporarily redirection or find what it is caused by?

我正在编写一些解析脚本,需要访问许多这样的网页 one

每当我尝试使用 urlopen 然后 read() 获取此页面时,我都会被重定向到此 page.

当我从 google 启动相同的链接时 chrome 重定向发生但实际上很少,大多数时候我尝试启动 url 不是通过从站点菜单单击它。

有没有办法避免使用 python3 从网站菜单重定向或模拟跳转到 url?

示例代码:

def getItemsFromPage(url):
    with urlopen(url) as page:
        html_doc = str(page.read())
    return re.findall('(http://www.charitynavigator.org/index.cfm\?bay=search\.summary&orgid=[\d]+)', html_doc)

url = 'http://www.charitynavigator.org/index.cfm?bay=search.alpha&ltr=1'
items_urls = getItemsFromPage(url)
with urlopen(item_urls[0]) as item_page:
    print(item_page.read().decode('utf-8')) # Here i get search.advanced instead of item page

您的问题不是在 url 字符串中将 & 替换为 &。我使用 urllib3 重写了您的代码,如下所示,并获得了预期的网页。

import re
import urllib3

def getItemsFromPage(url):
    # create connection pool object (urllib3-specific)
    localpool = urllib3.PoolManager() 
    with localpool.request('get', url) as page:
        html_doc = page.data.decode('utf-8')
    return re.findall('(http://www.charitynavigator.org/index.cfm\?bay=search\.summary&orgid=[\d]+)', html_doc)

# the master webpage
url_master = 'http://www.charitynavigator.org/index.cfm?bay=search.alpha&ltr=1'
# name and store the downloaded contents for testing purpose.
file_folder = "R:"
file_mainname = "test"

# parse the master webpage
items_urls = getItemsFromPage(url_master)

# create pool
mypool = urllib3.PoolManager()

i=0;
for url in items_urls:
    # file name to be saved
    file_name = file_folder + "\" + file_mainname + str(i) + ".htm"
    # replace '&' with r'&'
    url_OK = re.sub(r'&', r'&', url)
    # print revised url
    print(url_OK) 
    ### the urllib3-pythonic way of web page retrieval ###
    with mypool.request('get', url_OK) as page, open(file_name, 'w') as f:
        f.write(page.data.decode('utf-8'))
    i+=1

(在 python 3.4 eclipse PyDev win7 x64 上验证)

事实上,原始 html 数据中的符号是一个奇怪的问题。当您访问该网页并单击 link 时,网络导航器会将和号 (&) 读取为“&”,并且可以正常工作。但是,python按原样读取数据,即原始数据。所以:

import urllib.request as net
from html.parser import HTMLParser
import re 

headers = {
    "User-Agent": "Mozilla/5.0 (X11; Linux x86_64; rv:36.0) Gecko/20100101 Firefox/36.0",
}

def unescape(items):
    html = HTMLParser()
    unescaped = []
    for i in items:
        unescaped.append(html.unescape(i))

    return unescaped


def getItemsFromPage(url):
    request = net.Request(url, headers=headers)
    response = str(net.urlopen(request).read())
    # --------------------------
    # FIX AMPERSANDS - unescape
    # --------------------------
    links = re.findall('(http://www.charitynavigator.org/index.cfm\?bay=search\.summary&orgid=[\d]+)', response)
    unescaped_links = unescape(links)

    return unescaped_links


url = 'http://www.charitynavigator.org/index.cfm?bay=search.alpha&ltr=1'
item_urls = getItemsFromPage(url)
request = net.Request(item_urls[0], headers=headers)
print(item_urls)
response = net.urlopen(request)

# DEBUG RESPONSE 
print(response.url)
print(80 * '-')

print("<title>Charity Navigator Rating - 10,000 Degrees</title>" in (response.read().decode('utf-8')))