Scrapy 脚本在 shell 中转换元素,但在我 运行 蜘蛛时不会

Scrapy script turns elements in shell but not when i run the spider

这是我的代码:

import scrapy
import pandas as pd

base_url = 'http://www.cleanman-cn.com/'

class CleanmanSpider(scrapy.Spider):
    name = 'clean'
    
    start_urls = ['http://www.cleanman-cn.com/productlist.php/']

    def parse(self, response):
        for cat in response.css('.wow.fadeInUp'):
                name = cat.css('a > p::text').get()
                if name is not None:
                    name = cat.css('a > p::text').get().strip()
                    link  = cat.css('a::attr(href)').get()
            
                    categories = {
                        'Categorie' : name,
                        'Url' : base_url + link
                    }
                    yield categories
                           
                    csv = pd.read_csv(r'C:\Users\hermi\WebScraping\Scrapy\cleanman\cleanman\cleanmancategories.csv')
                    urls = csv['Url']

                    for url in urls:
                        yield scrapy.Request(url, callback=self.parse)
                        master = response.css('.web_prolist')
                        for item in master:
                            li = item.css('li')
                            for x in li:
                                link = x.css('a::attr(href)').get()
                                yield link

当我使用 scrapy shell 获取我的元素时,结果如图所示

In [13]: master = response.css('.web_prolist')

In [18]: for item in master:
    ...:     li = item.css('li')
    ...:     for x in li:
    ...:         link = x.css('a::attr(href)').get()
    ...:         print(link)
    ...: 
product_show.php?id=789
product_show.php?id=790
product_show.php?id=707
product_show.php?id=708
product_show.php?id=709
product_show.php?id=710
product_show.php?id=711
product_show.php?id=712
product_show.php?id=713

当我 运行 我的蜘蛛我得到这个结果

2021-11-03 17:28:17 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.cleanman-cn.com/productlist.php/>
{'Categorie': 'Matching Series', 'Url': 'http://www.cleanman-cn.com/product.php?b_id=1'}
2021-11-03 17:28:17 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.cleanman-cn.com/productlist.php/>
{'Categorie': 'Two Piece Toilet', 'Url': 'http://www.cleanman-cn.com/product.php?b_id=2'}
2021-11-03 17:28:17 [scrapy.dupefilters] DEBUG: Filtered duplicate request: <GET http://www.cleanman-cn.com/product.php?b_id=1> 
- no more duplicates will be shown (see DUPEFILTER_DEBUG to show all duplicates)
2021-11-03 17:28:17 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.cleanman-cn.com/productlist.php/>
{'Categorie': 'One Piece Toilet', 'Url': 'http://www.cleanman-cn.com/product.php?b_id=3'}
2021-11-03 17:28:17 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.cleanman-cn.com/productlist.php/>
{'Categorie': 'Wall-hung Toilet', 'Url': 'http://www.cleanman-cn.com/product.php?b_id=4'}
2021-11-03 17:28:17 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.cleanman-cn.com/productlist.php/>
{'Categorie': 'Art Basin', 'Url': 'http://www.cleanman-cn.com/product.php?b_id=5'}
2021-11-03 17:28:17 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.cleanman-cn.com/productlist.php/>
{'Categorie': 'Color Art Basin', 'Url': 'http://www.cleanman-cn.com/product.php?b_id=6'}
2021-11-03 17:28:17 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.cleanman-cn.com/productlist.php/>
{'Categorie': 'Matt Finish Series', 'Url': 'http://www.cleanman-cn.com/product.php?b_id=7'}
2021-11-03 17:28:17 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.cleanman-cn.com/productlist.php/>
{'Categorie': 'Intelligent Toilet', 'Url': 'http://www.cleanman-cn.com/product.php?b_id=8'}
2021-11-03 17:28:17 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.cleanman-cn.com/productlist.php/>
{'Categorie': 'Wall-hung Basin', 'Url': 'http://www.cleanman-cn.com/product.php?b_id=9'}
2021-11-03 17:28:17 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.cleanman-cn.com/productlist.php/>
{'Categorie': 'Pedestal basin', 'Url': 'http://www.cleanman-cn.com/product.php?b_id=10'}
2021-11-03 17:28:17 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.cleanman-cn.com/productlist.php/>
{'Categorie': 'Accessory', 'Url': 'http://www.cleanman-cn.com/product.php?b_id=11'}
2021-11-03 17:28:17 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.cleanman-cn.com/product.php?b_id=10> (referer: http://www.cleanman-cn.com/productlist.php/)
2021-11-03 17:28:17 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.cleanman-cn.com/product.php?b_id=3> (referer: http://www.cleanman-cn.com/productlist.php/)
2021-11-03 17:28:17 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.cleanman-cn.com/product.php?b_id=5> (referer: http://www.cleanman-cn.com/productlist.php/)
2021-11-03 17:28:17 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.cleanman-cn.com/product.php?b_id=11> (referer: http://www.cleanman-cn.com/productlist.php/)
2021-11-03 17:28:17 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.cleanman-cn.com/product.php?b_id=2> (referer: http://www.cleanman-cn.com/productlist.php/)
2021-11-03 17:28:17 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.cleanman-cn.com/product.php?b_id=4> (referer: http://www.cleanman-cn.com/productlist.php/)
2021-11-03 17:28:17 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.cleanman-cn.com/product.php?b_id=6> (referer: http://www.cleanman-cn.com/productlist.php/)
2021-11-03 17:28:17 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.cleanman-cn.com/product.php?b_id=1> (referer: http://www.cleanman-cn.com/productlist.php/)
2021-11-03 17:28:18 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.cleanman-cn.com/product.php?b_id=9> (referer: http://www.cleanman-cn.com/productlist.php/)
2021-11-03 17:28:18 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.cleanman-cn.com/product.php?b_id=7> (referer: http://www.cleanman-cn.com/productlist.php/)
2021-11-03 17:28:18 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.cleanman-cn.com/product.php?b_id=8> (referer: http://www.cleanman-cn.com/productlist.php/)

我正在使用 yield 获取每个类别第一页上的所有链接,使用 yield scrapy 请求中的这些链接对每个 url 进行响应并获取产品链接,我将在其中获取详细信息信息

但我可以让它工作,虽然一切对我来说似乎都是正确的,shell 结果给出了正确的输出

我做错了什么?

我是一个自学成才的 python “开发者”,只是为了好玩,我相信我做错了什么,我可以解决它。请善待对我的代码或我的编码方式的批评,但这是我的学习过程。

提前致谢

首先,您需要删除 ['http://www.cleanman-cn.com/productlist.php/'] 末尾的 /(测试是否使用斜杠以查看区别)。

您尝试生成一个字符串:ERROR: Spider must return request, item, or None, got 'str' (link).

您可能还想在另一个函数中抓取 link:

import scrapy
import pandas as pd

base_url = 'http://www.cleanman-cn.com/'

class CleanmanSpider(scrapy.Spider):
    name = 'clean'

    # here I removed the slash at the end
    start_urls = ['http://www.cleanman-cn.com/productlist.php']

    def parse(self, response):
        for cat in response.css('.wow.fadeInUp'):
            name = cat.css('a > p::text').get()
            if name is not None:
                name = cat.css('a > p::text').get().strip()
                link  = cat.css('a::attr(href)').get()

                categories = {
                    'Categorie' : name,
                    'Url' : base_url + link
                }
                yield categories

                csv = pd.read_csv(r'C:\Users\hermi\WebScraping\Scrapy\cleanman\cleanman\cleanmancategories.csv')
                urls = csv['Url']

                for url in urls:
                    # since I don't have your 'cleanmancategories' I tested it with url=base_url + link
                    yield scrapy.Request(url=url, callback=self.parse_items)


    def parse_items(self, response):
        master = response.css('.web_prolist')
        for item in master:
            li = item.css('li')
            for x in li:
                link = x.css('a::attr(href)').get()
                yield {'link': link}

输出:

{'link': 'product_show.php?id=773'}
[scrapy.core.scraper] DEBUG: Scraped from <200 http://www.cleanman-cn.com/product.php?b_id=1>
{'link': 'product_show.php?id=774'}
[scrapy.core.scraper] DEBUG: Scraped from <200 http://www.cleanman-cn.com/product.php?b_id=1>
{'link': 'product_show.php?id=775'}
...
...
...