如何在 Python 中为 scrapy 机器人去除不同末端的字符串?
How can I strip a string with different ends in Python for scrapy bot?
我正在构建一个 scrapy 蜘蛛,但需要有效且正确的方法来剥离包含 url 的字符串。 url 总是以 ['u and ends with '] 开头
例如[u'http://example.com/2334878']
def parse(self, response):
for sel in response.xpath("//div[@class='category']/a"):
item = SpiderItem()
item['title'] = sel.xpath('text()').extract()
item['link'] = sel.xpath('@href').extract()
linkToPost = str(item['link'])
linkToPost = linkToPost.strip("['u")
linkToPost = linkToPost.replace("'", "")
linkToPost = linkToPost.replace("]", "")
print linkToPost
#Parse request to follow the posting link into the actual post
request = scrapy.Request(linkToPost , callback=self.parse_item_page)
request.meta['item'] = item
yield request
这是因为 extract()
会 return 你 list:
extract()
Serialize and return the matched nodes as a list of
unicode strings. Percent encoded content is unquoted.
这里最 "scrapic" 的方法是使用 ItemLoader
and the TakeFirst
or Join
处理器。
或者,只从列表中获取第一个元素:
item['title'] = sel.xpath('text()').extract()[0]
item['link'] = sel.xpath('@href').extract()[0]
我正在构建一个 scrapy 蜘蛛,但需要有效且正确的方法来剥离包含 url 的字符串。 url 总是以 ['u and ends with '] 开头 例如[u'http://example.com/2334878']
def parse(self, response):
for sel in response.xpath("//div[@class='category']/a"):
item = SpiderItem()
item['title'] = sel.xpath('text()').extract()
item['link'] = sel.xpath('@href').extract()
linkToPost = str(item['link'])
linkToPost = linkToPost.strip("['u")
linkToPost = linkToPost.replace("'", "")
linkToPost = linkToPost.replace("]", "")
print linkToPost
#Parse request to follow the posting link into the actual post
request = scrapy.Request(linkToPost , callback=self.parse_item_page)
request.meta['item'] = item
yield request
这是因为 extract()
会 return 你 list:
extract()
Serialize and return the matched nodes as a list of unicode strings. Percent encoded content is unquoted.
这里最 "scrapic" 的方法是使用 ItemLoader
and the TakeFirst
or Join
处理器。
或者,只从列表中获取第一个元素:
item['title'] = sel.xpath('text()').extract()[0]
item['link'] = sel.xpath('@href').extract()[0]