scrapy: 请求 url 必须是 str 或 unicode,得到选择器
scrapy: request url must be str or unicode, got Selector
我正在使用 Scrapy 编写一个蜘蛛,以抓取 Pinterest 的用户详细信息。我正在尝试获取用户及其关注者的详细信息(依此类推,直到最后一个节点)。
下面是爬虫代码:
从 scrapy.spider 导入 BaseSpider
导入scrapy
来自 pinners.items 导入 PinterestItem
来自 scrapy.http 导入 FormRequest
从 urlparse 导入 urlparse
class 样本(BaseSpider):
name = 'sample'
allowed_domains = ['pinterest.com']
start_urls = ['https://www.pinterest.com/banka/followers', ]
def parse(self, response):
for base_url in response.xpath('//div[@class="Module User gridItem"]/a/@href'):
list_a = response.urljoin(base_url.extract())
for new_urls in response.xpath('//div[@class="Module User gridItem"]/a/@href'):
yield scrapy.Request(new_urls, callback=self.Next)
yield scrapy.Request(list_a, callback=self.Next)
def Next(self, response):
href_base = response.xpath('//div[@class = "tabs"]/ul/li/a')
href_board = href_base.xpath('//div[@class="BoardCount Module"]')
href_pin = href_base.xpath('.//div[@class="Module PinCount"]')
href_like = href_base.xpath('.//div[@class="LikeCount Module"]')
href_followers = href_base.xpath('.//div[@class="FollowerCount Module"]')
href_following = href_base.xpath('.//div[@class="FollowingCount Module"]')
item = PinterestItem()
item["Board_Count"] = href_board.xpath('.//span[@class="value"]/text()').extract()[0]
item["Pin_Count"] = href_pin.xpath('.//span[@class="value"]/text()').extract()
item["Like_Count"] = href_like.xpath('.//span[@class="value"]/text()').extract()
item["Followers_Count"] = href_followers.xpath('.//span[@class="value"]/text()').extract()
item["Following_Count"] = href_following.xpath('.//span[@class="value"]/text()').extract()
item["User_ID"] = response.xpath('//link[@rel="canonical"]/@href').extract()[0]
yield item
我收到以下错误:
raise TypeError('Request url must be str or unicode, got %s:' % type(url).__name__)
TypeError: Request url must be str or unicode, got Selector:
我确实检查了 list_a 的类型(提取的网址)。它给了我 unicode。
错误是由parse方法中的内部for循环产生的:
for new_urls in response.xpath('//div[@class="Module User gridItem"]/a/@href'):
yield scrapy.Request(new_urls, callback=self.Next)
new_urls
变量实际上是一个选择器,请尝试这样:
for base_url in response.xpath('//div[@class="Module User gridItem"]/a/@href'):
list_a = response.urljoin(base_url.extract())
yield scrapy.Request(list_a, callback=self.Next)
我正在使用 Scrapy 编写一个蜘蛛,以抓取 Pinterest 的用户详细信息。我正在尝试获取用户及其关注者的详细信息(依此类推,直到最后一个节点)。
下面是爬虫代码:
从 scrapy.spider 导入 BaseSpider
导入scrapy 来自 pinners.items 导入 PinterestItem 来自 scrapy.http 导入 FormRequest 从 urlparse 导入 urlparse
class 样本(BaseSpider):
name = 'sample'
allowed_domains = ['pinterest.com']
start_urls = ['https://www.pinterest.com/banka/followers', ]
def parse(self, response):
for base_url in response.xpath('//div[@class="Module User gridItem"]/a/@href'):
list_a = response.urljoin(base_url.extract())
for new_urls in response.xpath('//div[@class="Module User gridItem"]/a/@href'):
yield scrapy.Request(new_urls, callback=self.Next)
yield scrapy.Request(list_a, callback=self.Next)
def Next(self, response):
href_base = response.xpath('//div[@class = "tabs"]/ul/li/a')
href_board = href_base.xpath('//div[@class="BoardCount Module"]')
href_pin = href_base.xpath('.//div[@class="Module PinCount"]')
href_like = href_base.xpath('.//div[@class="LikeCount Module"]')
href_followers = href_base.xpath('.//div[@class="FollowerCount Module"]')
href_following = href_base.xpath('.//div[@class="FollowingCount Module"]')
item = PinterestItem()
item["Board_Count"] = href_board.xpath('.//span[@class="value"]/text()').extract()[0]
item["Pin_Count"] = href_pin.xpath('.//span[@class="value"]/text()').extract()
item["Like_Count"] = href_like.xpath('.//span[@class="value"]/text()').extract()
item["Followers_Count"] = href_followers.xpath('.//span[@class="value"]/text()').extract()
item["Following_Count"] = href_following.xpath('.//span[@class="value"]/text()').extract()
item["User_ID"] = response.xpath('//link[@rel="canonical"]/@href').extract()[0]
yield item
我收到以下错误:
raise TypeError('Request url must be str or unicode, got %s:' % type(url).__name__)
TypeError: Request url must be str or unicode, got Selector:
我确实检查了 list_a 的类型(提取的网址)。它给了我 unicode。
错误是由parse方法中的内部for循环产生的:
for new_urls in response.xpath('//div[@class="Module User gridItem"]/a/@href'):
yield scrapy.Request(new_urls, callback=self.Next)
new_urls
变量实际上是一个选择器,请尝试这样:
for base_url in response.xpath('//div[@class="Module User gridItem"]/a/@href'):
list_a = response.urljoin(base_url.extract())
yield scrapy.Request(list_a, callback=self.Next)