文本抓取的多处理

Question

我想从页面中抓取 <p>，因为会有几千个页面，所以我想使用多处理。但是，当我尝试将结果附加到某个变量时它不起作用

我想将抓取的结果附加到 data = []

我为基础网站制作了一个 url_common，因为有些页面不是以 HTTP 等开头的

from tqdm import tqdm

import faster_than_requests as requests   #20% faster on average in my case than urllib.request
import bs4 as bs

def scrape(link, data):
    for i in tqdm(link):
        if i[:3] !='htt':
            url_common = 'https://www.common_url.com/'
        else:
             url_common = ''
        try: 
             ht = requests.get2str(url_common + str(i))
        except:
            pass
        parsed = bs.BeautifulSoup(ht,'lxml')
        paragraphs = parsed.find_all('p')
        for p in paragraphs:
            data.append(p.text)

上面的不起作用，因为 map() 不接受上面的函数

我尝试用另一种方式使用它：

def scrape(link):
    for i in tqdm(link):
        if i[:3] !='htt':
            url_common = 'https://www.common_url.com/'
        else:
             url_common = ''
        try: 
             ht = requests.get2str(url_common + str(i))
        except:
            pass
        parsed = bs.BeautifulSoup(ht,'lxml')
        paragraphs = parsed.find_all('p')
        for p in paragraphs:
            print(p.text)

from multiprocessing import Pool
p = Pool(10)

links = ['link', 'other_link', 'another_link']
data = p.map(scrape, links)

我在使用上面的函数时遇到这个错误：

  Traceback (most recent call last):
  File "C:\ProgramData\Anaconda3\lib\multiprocessing\process.py", line 297, in _bootstrap
    self.run()
  File "C:\ProgramData\Anaconda3\lib\multiprocessing\process.py", line 99, in run
    self._target(*self._args, **self._kwargs)
  File "C:\ProgramData\Anaconda3\lib\multiprocessing\pool.py", line 110, in worker
    task = get()
  File "C:\ProgramData\Anaconda3\lib\multiprocessing\queues.py", line 354, in get
    return _ForkingPickler.loads(res)
AttributeError: Can't get attribute 'scrape' on <module '__main__' (built-in)>

我还没有想出一种方法来使用 Pool 并同时将抓取的结果附加到给定变量

编辑

我稍微改变一下，看看它在哪里停止：

def scrape(link):
    for i in tqdm(link):
        if i[:3] !='htt':
            url_common = 'https://www.investing.com/'
        else:
            url_common = ''
        try: #tries are always halpful with url as you never know
            ht = requests.get2str(url_common + str(i))
        except:
            pass
        print('works1')
        parsed = bs.BeautifulSoup(ht,'lxml')
        paragraphs = parsed.find_all('p')
        print('works2')
        for p in paragraphs:
            print(p.text)

links = ['link', 'other_link', 'another_link']
scrape(links) 
#WORKS PROPERLY AND PRINTS EVERYTHING 

if __name__ == '__main__':
    p = Pool(5)
    print(p.map(scrape, links))
#DOESN'T WORK, NOTHING PRINTS. Error like above

Answer 1

您使用的 map 功能不正确。

它遍历 iterable 的每个元素并在每个元素上调用函数。

您可以看到地图函数执行如下操作：

to_be_mapped = [1, 2, 3]
mapped = []

def mapping(x): # <-- note that the mapping accepts a single value
    return x**2

for item in to_be_mapped:
    res = mapping(item)
    mapped.append(res)

因此，为了解决您的问题，请删除最外层的 for 循环，因为迭代由 map 函数处理

def scrape(link):
  if link[:3] !='htt':
      url_common = 'https://www.common_url.com/'
  else:
        url_common = ''
  try: 
        ht = requests.get2str(url_common + str(link))
  except:
      pass
  parsed = bs.BeautifulSoup(ht,'lxml')
  paragraphs = parsed.find_all('p')
  for p in paragraphs:
      print(p.text)

文本抓取的多处理

Multiprocessing with text scraping

python

text-mining

multiprocessing