返回组结果时出现 returns Json 错误的组的 Celery 任务

Question

我的工作流程有点复杂，但我希望有人能从下面的解释或代码中理解它。

基本上，我正在收集 site/directory 的公司。当查询通过时，它 returns 公司的迷你资料，即每页 50 家公司。使用芹菜，我试图使用一组任务从搜索结果的总页数中获取所有公司。工作流程如下：

获取所有 10 个页面的所有公司（每页 50 个公司） group(process_ali.s(url, query) for url in urls )()。在这种情况下 urls == 10 和 url 将有 50 家公司
这意味着我有一个外部列表，其中包含每个页面的列表结果。每个结果都是一个字典
group(company_worker.s(i) for i in res)() 第1步的结果为作为一个组处理

注意 i 是包含每个页面结果的列表 company_work 进程通过调用另一个组也将此列表作为一个组。

@shared_task
def company_worker(items):
    return group(get_site.s(item) for item in items)()

@shared_task
def process_ali(url, query):
    content = get_documment.s(url)()
    doc = abs_url(content, url)
    if doc is None:
        return
    companies = []
    for sel in doc.xpath("//div[@class='item-main']"):
        item = {'source': 'ebay'}

        company = sel.xpath("div[@class='top']/div[@class='corp']/div[@class='item-title']/"
                       "h2[@class='title ellipsis']/a/text()")[0]
        contact_url = sel.xpath("div[@class='top']/div[@class='corp']/"
                                "div[@class='company']/a[@class='cd']/@href")[0]

        item['contact_url'] = contact_url

        companies.append(item)

    return companies

@shared_task
def get_site(item):
    site = item.get('contact_url')
    content = get_documment.s(site)() # this module handle requests and returns content of page as string
    doc = abs_url(content, site)  # make links in page absolute and return parseable element tree (lxml.html)
    web_urls = doc.xpath("//div[@class='company-contact-information']/table/tr/td/a/@href")  # more than one website possible

    #validate each website
    webs = []
    for url in web_urls:
        uri = None
        if len(url) > 6 and url.startswith('http'):
            uri = url
        elif url.startswith('www'):
            uri = 'http://' + url

        if uri:
            up = urlparse(uri)
            site = up.scheme + "://" + up.netloc
            webs.append(site)
     # remove bad links e.g site.ebay.com, faceboo.com
    item['website'] = list(filter(None, remove_blacklist_links(webs)))
    if item['website']:
        #store website and item['company'] in DB
         ...

        return item['website']



def process_site(engine, query):
    # do some stuff here with engine and also find total pages base_url... using requests and lxml
    urls =[]
    for x in range(1, total_pages+1):
        start_url = page + "{}.html".format(x)
        print(start_url)
        urls.append(start_url)

   # process bunch of urls that returns list containing list of dictionaries i.e 
   # [[{'url':'http://example.org'},{'url':'http://example.com'}], [{...},{...}]]
    res = group(process_ali.s(url, query) for url in urls )() 

    all = group(company_worker.s(i) for i in res)() # process outer list above as a group
    return all

这就是我从 Python 解释器调用任务的方式。

>>> from b2b.tasks import *
>>> from pprint import pprint
>>> from celery import shared_task, group, task, chain, chord
>>> from celery.task.sets import subtask
>>> base_url = "http://ebay.com"
>>> query = "bag"
>>> res = process_site.s(base_url, query)()

http://www.ebay.com/company/bag/-50/1.html
http://www.ebay.com/company/bag/-50/2.html
http://www.ebay.com/company/bag/-50/3.html
http://www.ebay.com/company/bag/-50/4.html
http://www.ebay.com/company/bag/-50/5.html
...

我在上面的 url 列表后立即得到回溯...

Traceback (most recent call last):
  File "<console>", line 1, in <module>
  File "/Users/Me/.virtualenvs/djangoscrape/lib/python2.7/site-packages/celery/canvas.py", line 172, in __call__
    return self.type(*args, **kwargs)
  File "/Users/Me/.virtualenvs/djangoscrape/lib/python2.7/site-packages/celery/app/task.py", line 420, in __call__
    return self.run(*args, **kwargs)
  File "/Users/Me/projects/django_stuff/scraper/b2b/tasks.py", line 224, in process_site
    all = group(company_worker.s(i) for i in res)()
  File "/Users/Me/.virtualenvs/djangoscrape/lib/python2.7/site-packages/celery/canvas.py", line 525, in __call__
    return self.apply_async(partial_args, **options)
  File "/Users/Me/.virtualenvs/djangoscrape/lib/python2.7/site-packages/celery/canvas.py", line 504, in apply_async
    add_to_parent=add_to_parent)
  File "/Users/Me/.virtualenvs/djangoscrape/lib/python2.7/site-packages/celery/app/task.py", line 420, in __call__
    return self.run(*args, **kwargs)
  File "/Users/Me/.virtualenvs/djangoscrape/lib/python2.7/site-packages/celery/app/builtins.py", line 172, in run
    add_to_parent=False) for stask in taskit]
  File "/Users/Me/.virtualenvs/djangoscrape/lib/python2.7/site-packages/celery/canvas.py", line 251, in apply_async
    return _apply(args, kwargs, **options)
  File "/Users/Me/.virtualenvs/djangoscrape/lib/python2.7/site-packages/celery/app/task.py", line 559, in apply_async
    **dict(self._get_exec_options(), **options)
  File "/Users/Me/.virtualenvs/djangoscrape/lib/python2.7/site-packages/celery/app/base.py", line 353, in send_task
    reply_to=reply_to or self.oid, **options
  File "/Users/Me/.virtualenvs/djangoscrape/lib/python2.7/site-packages/celery/app/amqp.py", line 305, in publish_task
    **kwargs
  File "/Users/Me/.virtualenvs/djangoscrape/lib/python2.7/site-packages/kombu/messaging.py", line 165, in publish
    compression, headers)
  File "/Users/Me/.virtualenvs/djangoscrape/lib/python2.7/site-packages/kombu/messaging.py", line 241, in _prepare
    body) = dumps(body, serializer=serializer)
  File "/Users/Me/.virtualenvs/djangoscrape/lib/python2.7/site-packages/kombu/serialization.py", line 164, in dumps
    payload = encoder(data)
  File "/usr/local/Cellar/python/2.7.10_2/Frameworks/Python.framework/Versions/2.7/lib/python2.7/contextlib.py", line 35, in __exit__
    self.gen.throw(type, value, traceback)
  File "/Users/Me/.virtualenvs/djangoscrape/lib/python2.7/site-packages/kombu/serialization.py", line 59, in _reraise_errors
    reraise(wrapper, wrapper(exc), sys.exc_info()[2])
  File "/Users/Me/.virtualenvs/djangoscrape/lib/python2.7/site-packages/kombu/serialization.py", line 55, in _reraise_errors
    yield
  File "/Users/Me/.virtualenvs/djangoscrape/lib/python2.7/site-packages/kombu/serialization.py", line 164, in dumps
    payload = encoder(data)
  File "/Users/Me/.virtualenvs/djangoscrape/lib/python2.7/site-packages/anyjson/__init__.py", line 141, in dumps
    return implementation.dumps(value)
  File "/Users/Me/.virtualenvs/djangoscrape/lib/python2.7/site-packages/anyjson/__init__.py", line 87, in dumps
    return self._encode(data)
  File "/Users/Me/.virtualenvs/djangoscrape/lib/python2.7/site-packages/simplejson/__init__.py", line 380, in dumps
    return _default_encoder.encode(obj)
  File "/Users/Me/.virtualenvs/djangoscrape/lib/python2.7/site-packages/simplejson/encoder.py", line 275, in encode
    chunks = self.iterencode(o, _one_shot=True)
  File "/Users/Me/.virtualenvs/djangoscrape/lib/python2.7/site-packages/simplejson/encoder.py", line 357, in iterencode
    return _iterencode(o, 0)
  File "/Users/Me/.virtualenvs/djangoscrape/lib/python2.7/site-packages/simplejson/encoder.py", line 252, in default
    raise TypeError(repr(o) + " is not JSON serializable")
EncodeError: <AsyncResult: 7838a203-a853-4755-992b-cfd67207d398> is not JSON serializable
>>>

Answer 1

发送到 celery 任务的参数必须 JSON 可序列化（例如，字符串、列表字典等），因此很可能其中一个任务的参数之一不是。

返回组结果时出现 returns Json 错误的组的 Celery 任务

Celery tasks with groups which returns Json error when returning group results

python

celery

django-celery

celerybeat