Scrapy 脚本在 shell 中转换元素,但在我 运行 蜘蛛时不会
Scrapy script turns elements in shell but not when i run the spider
这是我的代码:
import scrapy
import pandas as pd
base_url = 'http://www.cleanman-cn.com/'
class CleanmanSpider(scrapy.Spider):
name = 'clean'
start_urls = ['http://www.cleanman-cn.com/productlist.php/']
def parse(self, response):
for cat in response.css('.wow.fadeInUp'):
name = cat.css('a > p::text').get()
if name is not None:
name = cat.css('a > p::text').get().strip()
link = cat.css('a::attr(href)').get()
categories = {
'Categorie' : name,
'Url' : base_url + link
}
yield categories
csv = pd.read_csv(r'C:\Users\hermi\WebScraping\Scrapy\cleanman\cleanman\cleanmancategories.csv')
urls = csv['Url']
for url in urls:
yield scrapy.Request(url, callback=self.parse)
master = response.css('.web_prolist')
for item in master:
li = item.css('li')
for x in li:
link = x.css('a::attr(href)').get()
yield link
当我使用 scrapy shell 获取我的元素时,结果如图所示
In [13]: master = response.css('.web_prolist')
In [18]: for item in master:
...: li = item.css('li')
...: for x in li:
...: link = x.css('a::attr(href)').get()
...: print(link)
...:
product_show.php?id=789
product_show.php?id=790
product_show.php?id=707
product_show.php?id=708
product_show.php?id=709
product_show.php?id=710
product_show.php?id=711
product_show.php?id=712
product_show.php?id=713
当我 运行 我的蜘蛛我得到这个结果
2021-11-03 17:28:17 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.cleanman-cn.com/productlist.php/>
{'Categorie': 'Matching Series', 'Url': 'http://www.cleanman-cn.com/product.php?b_id=1'}
2021-11-03 17:28:17 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.cleanman-cn.com/productlist.php/>
{'Categorie': 'Two Piece Toilet', 'Url': 'http://www.cleanman-cn.com/product.php?b_id=2'}
2021-11-03 17:28:17 [scrapy.dupefilters] DEBUG: Filtered duplicate request: <GET http://www.cleanman-cn.com/product.php?b_id=1>
- no more duplicates will be shown (see DUPEFILTER_DEBUG to show all duplicates)
2021-11-03 17:28:17 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.cleanman-cn.com/productlist.php/>
{'Categorie': 'One Piece Toilet', 'Url': 'http://www.cleanman-cn.com/product.php?b_id=3'}
2021-11-03 17:28:17 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.cleanman-cn.com/productlist.php/>
{'Categorie': 'Wall-hung Toilet', 'Url': 'http://www.cleanman-cn.com/product.php?b_id=4'}
2021-11-03 17:28:17 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.cleanman-cn.com/productlist.php/>
{'Categorie': 'Art Basin', 'Url': 'http://www.cleanman-cn.com/product.php?b_id=5'}
2021-11-03 17:28:17 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.cleanman-cn.com/productlist.php/>
{'Categorie': 'Color Art Basin', 'Url': 'http://www.cleanman-cn.com/product.php?b_id=6'}
2021-11-03 17:28:17 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.cleanman-cn.com/productlist.php/>
{'Categorie': 'Matt Finish Series', 'Url': 'http://www.cleanman-cn.com/product.php?b_id=7'}
2021-11-03 17:28:17 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.cleanman-cn.com/productlist.php/>
{'Categorie': 'Intelligent Toilet', 'Url': 'http://www.cleanman-cn.com/product.php?b_id=8'}
2021-11-03 17:28:17 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.cleanman-cn.com/productlist.php/>
{'Categorie': 'Wall-hung Basin', 'Url': 'http://www.cleanman-cn.com/product.php?b_id=9'}
2021-11-03 17:28:17 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.cleanman-cn.com/productlist.php/>
{'Categorie': 'Pedestal basin', 'Url': 'http://www.cleanman-cn.com/product.php?b_id=10'}
2021-11-03 17:28:17 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.cleanman-cn.com/productlist.php/>
{'Categorie': 'Accessory', 'Url': 'http://www.cleanman-cn.com/product.php?b_id=11'}
2021-11-03 17:28:17 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.cleanman-cn.com/product.php?b_id=10> (referer: http://www.cleanman-cn.com/productlist.php/)
2021-11-03 17:28:17 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.cleanman-cn.com/product.php?b_id=3> (referer: http://www.cleanman-cn.com/productlist.php/)
2021-11-03 17:28:17 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.cleanman-cn.com/product.php?b_id=5> (referer: http://www.cleanman-cn.com/productlist.php/)
2021-11-03 17:28:17 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.cleanman-cn.com/product.php?b_id=11> (referer: http://www.cleanman-cn.com/productlist.php/)
2021-11-03 17:28:17 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.cleanman-cn.com/product.php?b_id=2> (referer: http://www.cleanman-cn.com/productlist.php/)
2021-11-03 17:28:17 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.cleanman-cn.com/product.php?b_id=4> (referer: http://www.cleanman-cn.com/productlist.php/)
2021-11-03 17:28:17 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.cleanman-cn.com/product.php?b_id=6> (referer: http://www.cleanman-cn.com/productlist.php/)
2021-11-03 17:28:17 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.cleanman-cn.com/product.php?b_id=1> (referer: http://www.cleanman-cn.com/productlist.php/)
2021-11-03 17:28:18 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.cleanman-cn.com/product.php?b_id=9> (referer: http://www.cleanman-cn.com/productlist.php/)
2021-11-03 17:28:18 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.cleanman-cn.com/product.php?b_id=7> (referer: http://www.cleanman-cn.com/productlist.php/)
2021-11-03 17:28:18 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.cleanman-cn.com/product.php?b_id=8> (referer: http://www.cleanman-cn.com/productlist.php/)
我正在使用 yield 获取每个类别第一页上的所有链接,使用 yield scrapy 请求中的这些链接对每个 url 进行响应并获取产品链接,我将在其中获取详细信息信息
但我可以让它工作,虽然一切对我来说似乎都是正确的,shell 结果给出了正确的输出
我做错了什么?
我是一个自学成才的 python “开发者”,只是为了好玩,我相信我做错了什么,我可以解决它。请善待对我的代码或我的编码方式的批评,但这是我的学习过程。
提前致谢
首先,您需要删除 ['http://www.cleanman-cn.com/productlist.php/']
末尾的 /(测试是否使用斜杠以查看区别)。
您尝试生成一个字符串:ERROR: Spider must return request, item, or None, got 'str'
(link).
您可能还想在另一个函数中抓取 link:
import scrapy
import pandas as pd
base_url = 'http://www.cleanman-cn.com/'
class CleanmanSpider(scrapy.Spider):
name = 'clean'
# here I removed the slash at the end
start_urls = ['http://www.cleanman-cn.com/productlist.php']
def parse(self, response):
for cat in response.css('.wow.fadeInUp'):
name = cat.css('a > p::text').get()
if name is not None:
name = cat.css('a > p::text').get().strip()
link = cat.css('a::attr(href)').get()
categories = {
'Categorie' : name,
'Url' : base_url + link
}
yield categories
csv = pd.read_csv(r'C:\Users\hermi\WebScraping\Scrapy\cleanman\cleanman\cleanmancategories.csv')
urls = csv['Url']
for url in urls:
# since I don't have your 'cleanmancategories' I tested it with url=base_url + link
yield scrapy.Request(url=url, callback=self.parse_items)
def parse_items(self, response):
master = response.css('.web_prolist')
for item in master:
li = item.css('li')
for x in li:
link = x.css('a::attr(href)').get()
yield {'link': link}
输出:
{'link': 'product_show.php?id=773'}
[scrapy.core.scraper] DEBUG: Scraped from <200 http://www.cleanman-cn.com/product.php?b_id=1>
{'link': 'product_show.php?id=774'}
[scrapy.core.scraper] DEBUG: Scraped from <200 http://www.cleanman-cn.com/product.php?b_id=1>
{'link': 'product_show.php?id=775'}
...
...
...
这是我的代码:
import scrapy
import pandas as pd
base_url = 'http://www.cleanman-cn.com/'
class CleanmanSpider(scrapy.Spider):
name = 'clean'
start_urls = ['http://www.cleanman-cn.com/productlist.php/']
def parse(self, response):
for cat in response.css('.wow.fadeInUp'):
name = cat.css('a > p::text').get()
if name is not None:
name = cat.css('a > p::text').get().strip()
link = cat.css('a::attr(href)').get()
categories = {
'Categorie' : name,
'Url' : base_url + link
}
yield categories
csv = pd.read_csv(r'C:\Users\hermi\WebScraping\Scrapy\cleanman\cleanman\cleanmancategories.csv')
urls = csv['Url']
for url in urls:
yield scrapy.Request(url, callback=self.parse)
master = response.css('.web_prolist')
for item in master:
li = item.css('li')
for x in li:
link = x.css('a::attr(href)').get()
yield link
当我使用 scrapy shell 获取我的元素时,结果如图所示
In [13]: master = response.css('.web_prolist')
In [18]: for item in master:
...: li = item.css('li')
...: for x in li:
...: link = x.css('a::attr(href)').get()
...: print(link)
...:
product_show.php?id=789
product_show.php?id=790
product_show.php?id=707
product_show.php?id=708
product_show.php?id=709
product_show.php?id=710
product_show.php?id=711
product_show.php?id=712
product_show.php?id=713
当我 运行 我的蜘蛛我得到这个结果
2021-11-03 17:28:17 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.cleanman-cn.com/productlist.php/>
{'Categorie': 'Matching Series', 'Url': 'http://www.cleanman-cn.com/product.php?b_id=1'}
2021-11-03 17:28:17 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.cleanman-cn.com/productlist.php/>
{'Categorie': 'Two Piece Toilet', 'Url': 'http://www.cleanman-cn.com/product.php?b_id=2'}
2021-11-03 17:28:17 [scrapy.dupefilters] DEBUG: Filtered duplicate request: <GET http://www.cleanman-cn.com/product.php?b_id=1>
- no more duplicates will be shown (see DUPEFILTER_DEBUG to show all duplicates)
2021-11-03 17:28:17 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.cleanman-cn.com/productlist.php/>
{'Categorie': 'One Piece Toilet', 'Url': 'http://www.cleanman-cn.com/product.php?b_id=3'}
2021-11-03 17:28:17 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.cleanman-cn.com/productlist.php/>
{'Categorie': 'Wall-hung Toilet', 'Url': 'http://www.cleanman-cn.com/product.php?b_id=4'}
2021-11-03 17:28:17 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.cleanman-cn.com/productlist.php/>
{'Categorie': 'Art Basin', 'Url': 'http://www.cleanman-cn.com/product.php?b_id=5'}
2021-11-03 17:28:17 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.cleanman-cn.com/productlist.php/>
{'Categorie': 'Color Art Basin', 'Url': 'http://www.cleanman-cn.com/product.php?b_id=6'}
2021-11-03 17:28:17 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.cleanman-cn.com/productlist.php/>
{'Categorie': 'Matt Finish Series', 'Url': 'http://www.cleanman-cn.com/product.php?b_id=7'}
2021-11-03 17:28:17 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.cleanman-cn.com/productlist.php/>
{'Categorie': 'Intelligent Toilet', 'Url': 'http://www.cleanman-cn.com/product.php?b_id=8'}
2021-11-03 17:28:17 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.cleanman-cn.com/productlist.php/>
{'Categorie': 'Wall-hung Basin', 'Url': 'http://www.cleanman-cn.com/product.php?b_id=9'}
2021-11-03 17:28:17 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.cleanman-cn.com/productlist.php/>
{'Categorie': 'Pedestal basin', 'Url': 'http://www.cleanman-cn.com/product.php?b_id=10'}
2021-11-03 17:28:17 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.cleanman-cn.com/productlist.php/>
{'Categorie': 'Accessory', 'Url': 'http://www.cleanman-cn.com/product.php?b_id=11'}
2021-11-03 17:28:17 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.cleanman-cn.com/product.php?b_id=10> (referer: http://www.cleanman-cn.com/productlist.php/)
2021-11-03 17:28:17 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.cleanman-cn.com/product.php?b_id=3> (referer: http://www.cleanman-cn.com/productlist.php/)
2021-11-03 17:28:17 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.cleanman-cn.com/product.php?b_id=5> (referer: http://www.cleanman-cn.com/productlist.php/)
2021-11-03 17:28:17 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.cleanman-cn.com/product.php?b_id=11> (referer: http://www.cleanman-cn.com/productlist.php/)
2021-11-03 17:28:17 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.cleanman-cn.com/product.php?b_id=2> (referer: http://www.cleanman-cn.com/productlist.php/)
2021-11-03 17:28:17 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.cleanman-cn.com/product.php?b_id=4> (referer: http://www.cleanman-cn.com/productlist.php/)
2021-11-03 17:28:17 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.cleanman-cn.com/product.php?b_id=6> (referer: http://www.cleanman-cn.com/productlist.php/)
2021-11-03 17:28:17 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.cleanman-cn.com/product.php?b_id=1> (referer: http://www.cleanman-cn.com/productlist.php/)
2021-11-03 17:28:18 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.cleanman-cn.com/product.php?b_id=9> (referer: http://www.cleanman-cn.com/productlist.php/)
2021-11-03 17:28:18 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.cleanman-cn.com/product.php?b_id=7> (referer: http://www.cleanman-cn.com/productlist.php/)
2021-11-03 17:28:18 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.cleanman-cn.com/product.php?b_id=8> (referer: http://www.cleanman-cn.com/productlist.php/)
我正在使用 yield 获取每个类别第一页上的所有链接,使用 yield scrapy 请求中的这些链接对每个 url 进行响应并获取产品链接,我将在其中获取详细信息信息
但我可以让它工作,虽然一切对我来说似乎都是正确的,shell 结果给出了正确的输出
我做错了什么?
我是一个自学成才的 python “开发者”,只是为了好玩,我相信我做错了什么,我可以解决它。请善待对我的代码或我的编码方式的批评,但这是我的学习过程。
提前致谢
首先,您需要删除 ['http://www.cleanman-cn.com/productlist.php/']
末尾的 /(测试是否使用斜杠以查看区别)。
您尝试生成一个字符串:ERROR: Spider must return request, item, or None, got 'str'
(link).
您可能还想在另一个函数中抓取 link:
import scrapy
import pandas as pd
base_url = 'http://www.cleanman-cn.com/'
class CleanmanSpider(scrapy.Spider):
name = 'clean'
# here I removed the slash at the end
start_urls = ['http://www.cleanman-cn.com/productlist.php']
def parse(self, response):
for cat in response.css('.wow.fadeInUp'):
name = cat.css('a > p::text').get()
if name is not None:
name = cat.css('a > p::text').get().strip()
link = cat.css('a::attr(href)').get()
categories = {
'Categorie' : name,
'Url' : base_url + link
}
yield categories
csv = pd.read_csv(r'C:\Users\hermi\WebScraping\Scrapy\cleanman\cleanman\cleanmancategories.csv')
urls = csv['Url']
for url in urls:
# since I don't have your 'cleanmancategories' I tested it with url=base_url + link
yield scrapy.Request(url=url, callback=self.parse_items)
def parse_items(self, response):
master = response.css('.web_prolist')
for item in master:
li = item.css('li')
for x in li:
link = x.css('a::attr(href)').get()
yield {'link': link}
输出:
{'link': 'product_show.php?id=773'}
[scrapy.core.scraper] DEBUG: Scraped from <200 http://www.cleanman-cn.com/product.php?b_id=1>
{'link': 'product_show.php?id=774'}
[scrapy.core.scraper] DEBUG: Scraped from <200 http://www.cleanman-cn.com/product.php?b_id=1>
{'link': 'product_show.php?id=775'}
...
...
...