从本地 scrapy 到 scrapy cloud (scraping hub) - 意想不到的结果

Question

我部署在Scrapy云上的爬虫与本地版本相比产生了意想不到的结果。我的本地版本可以轻松提取 product item （来自在线零售商）的每个字段，但在 scrapy cloud 上，字段 "ingredients" 和字段 "list of prices" 始终显示为空。你会在附上的图片中看到这两个元素，结果我总是空的，但它工作得很好我正在使用 Python 3 并且堆栈配置了 scrapy:1.3-py3 配置。我首先认为这是正则表达式和 unicode 的问题，但似乎不是。所以我尝试了一切：你，你的RE.ENCODE ....但没有用。

对于配料部分，我的代码如下：

    data_box=response.xpath('//*[@id="ingredients"]').css('div.information__tab__content *::text').extract()
    data_inter=''.join(data_box).strip()

    match1=re.search(r'([Ii]ngr[ée]dients\s*\:{0,1})\s*(.*)\.*',data_inter)
    match2=re.search(r'([Cc]omposition\s*\:{0,1})\s*(.*)\.*',data_inter)


    if match1:
        result_matching_ingredients=match1.group(1,2)[1].replace('"','').replace(".","").replace(";",",").strip()

    elif match2 : 
        result_matching_ingredients=match2.group(1,2)[1].replace('"','').replace(".","").replace(";",",").strip()

    else:
        result_matching_ingredients=''

    ingredients=result_matching_ingredients

似乎在scrapy cloud上从来没有匹配过。

对于价格，我的代码如下：

    list_prices=[]

    for package in list_packaging : 
        tonnage=package.css('div.product__varianttitle::text').extract_first().strip()
        prix_inter=(''.join(package.css('span.product__smallprice__text').re(r'\(\s*\d+\,\d*\s*€\s*\/\s*kg\)')))
        prix=prix_inter.replace("(","").replace(")","").replace("/","").replace("€","").replace("kg","").replace(",",".").strip()

        list_prices.append(prix)

这是同一个故事。还是空的。

我再说一遍：它在我的本地版本上运行良好。这两个数据是唯一导致问题的数据：我正在使用 scrapy cloud 提取一堆其他数据（也使用 Regex），我对此非常满意？

有什么想法吗？

Answer 1

我经常使用 ScrapingHub，通常我调试的方式是：

检查工作请求（通过ScrapingHub界面）

为了检查是否没有使页面略有不同的重定向，如查询字符串?lang=en

检查作业日志（通过ScrapingHub界面）

您可以打印或使用记录器通过解析器检查您想要的所有内容。所以如果你真的想确保爬虫在本地机器和 ScrapingHub 上显示相同，你可以 print(response.body) 并比较可能导致这种差异的原因。

如果你找不到，我会尝试在 ScrapingHub 上部署一个小蜘蛛并编辑这个 post 如果我今天有时间的话！

Answer 2

检查 Scrapping Hub 的日志是否显示 Python 的预期版本，即使在项目的 yml 文件中正确设置了堆栈。

从本地 scrapy 到 scrapy cloud (scraping hub) - 意想不到的结果

From local scrapy to scrapy cloud (scraping hub) - Unexpected results

regex

scrapy

python-3.x

scrapinghub