pool.map 列表索引超出范围 python

pool.map list index out of range python

大约有 70% 的机会显示错误:

    res=pool.map(feng,urls)
  File "c:\Python27\lib\multiprocessing\pool.py", line 251, in map
    return self.map_async(func, iterable, chunksize).get()
  File "c:\Python27\lib\multiprocessing\pool.py", line 567, in get
    raise self._value
IndexError: list index out of range

不知道为什么,如果数据少于 100,只有 5% 的机会表明message.any一个人知道如何改进?

#coding:utf-8
import multiprocessing
import requests 
import bs4
import re
import string
root_url =  'http://www.haoshiwen.org'

#index_url = root_url+'/type.php?c=1'
def xianqin_url():
    f = 0
    h = 0
    x = 0 
    y = 0 
    b = []
    l=[]


    for i in range(1,64):#页数
        index_url=root_url+'/type.php?c=1'+'&page='+"%s" % i
        response = requests.get(index_url)
        soup = bs4.BeautifulSoup(response.text,"html.parser")
        x = [a.attrs.get('href') for a in soup.select('div.sons a[href^=/]')]#取出每一页的div是sons的链接
        c=len(x)#一共c个链接
        j=0
        for j in range(c):
            url = root_url+x[j]
            us = str(url)
            print "收集到%s" % us
            l.append(url)  #pool = multiprocessing.Pool(8)
    return l

def feng (url) :
    response = requests.get(url)
    response.encoding='utf-8'
#print response.text
    soup = bs4.BeautifulSoup(response.text, "html.parser")
#content = soup.select('div.shileft')
    qq=str(soup)
    soupout = re.findall(r"原文(.+?)</div>",qq,re.S)#以“原文”开头<div>结尾的字段
    #print soupout[1]
    content=str(soupout[1])
    b="风"
    cc=content.count(b,0,len(content))
    return cc

def start_process():
    print 'Starting',multiprocessing.current_process().name

def feng (url) :
    response = requests.get(url)
    response.encoding='utf-8'
#print response.text
    soup = bs4.BeautifulSoup(response.text, "html.parser")
#content = soup.select('div.shileft')
    qq=str(soup)
    soupout = re.findall(r"原文(.+?)</div>",qq,re.S)#以“原文”开头<div>结尾的字段
    #print soupout[1]
    content=str(soupout[1])
    b="风"
    c="花"
    d="雪"
    e="月"
    f=content.count(b,0,len(content))
    h=content.count(c,0,len(content))
    x=content.count(d,0,len(content))
    y=content.count(e,0,len(content))
    return f,h,x,y


def find(urls):
    r= [0,0,0,0]
    pool=multiprocessing.Pool()
    res=pool.map4(feng,urls)
    for i in range(len(res)):
        r=map(lambda (a,b):a+b, zip(r,res[i]))
    return r


if __name__=="__main__":
    print "开始收集网址"
    qurls=xianqin_url()
    print "收集到%s个链接" % len(qurls)
    print "开始匹配先秦诗文"
    find(qurls)
    print '''
    %s篇先秦文章中:
---------------------------
    风有:%s
    花有:%s
    雪有:%s
    月有:%s
    数据来源:%s
    ''' % (len(qurls),find(qurls)[0],find(qurls)[1],find(qurls)[2],find(qurls)[3],root_url) 

Whosebug:正文不能包含“`pool ma p”。 将其更改为 res=pool.map4(feng,urls) 我正在尝试通过多处理从该网站获取一些子字符串。

确实,multiprocessing 使调试变得有点困难,因为您看不到 index out of bound 错误发生的位置(错误消息使它看起来好像它发生在 multiprocessing模块)。

在某些情况下,这一行:

content=str(soupout[1])

提出一个 index out of bound,因为 soupout 是一个空列表。如果你把它改成

if len(soupout) == 0:
    return None

然后删除通过更改

返回的None
res=pool.map(feng,urls)

进入

res = pool.map(feng,urls)
res = [r for r in res if r is not None]

那么你就可以避免这个错误了。就是说。您可能想找出 re.findall 返回空列表的根本原因。 select 使用 beatifulsoup 的节点肯定比使用正则表达式更好,因为通常与 bs4 匹配更稳定,特别是如果网站略微更改其标记(例如空格等) .)

更新:

why is soupout is an empty list? When I didn't use pool.map never I have this error message shown

这可能是因为您对网络服务器的攻击速度太快了。在评论中,您提到您有时会在 response.status_code 中得到 504。 504 表示 Gateway Time-out: The server was acting as a gateway or proxy and did not receive a timely response from the upstream server

这是因为 haoshiwen.org 似乎由 kangle which is a reverse proxy. Now the reverse proxy handles back all the requests you send him to the web server behind, and if you now start too many processes at once the poor web server cannot handle the flood. Kangle has a default timeout of 60s 提供支持,所以只要他在 60 秒内没有从网络服务器得到回复,他就会显示您发布的错误。

你如何解决这个问题?

  • 您可以限制进程数:pool=multiprocessing.Pool(2),您需要使用大量进程
  • feng(url) 的顶部,您可以添加一个 time.sleep(5),这样每个进程在每个请求之间等待 5 秒。同样在这里,您需要调整睡眠时间。