如何在 Python 中为 scrapy 机器人去除不同末端的字符串？

Question

我正在构建一个 scrapy 蜘蛛，但需要有效且正确的方法来剥离包含 url 的字符串。 url 总是以 ['u and ends with '] 开头例如[u'http://example.com/2334878']

def parse(self, response):
    for sel in response.xpath("//div[@class='category']/a"):
        item = SpiderItem()
        item['title'] = sel.xpath('text()').extract()
        item['link'] = sel.xpath('@href').extract()
        linkToPost = str(item['link'])
        linkToPost = linkToPost.strip("['u")
        linkToPost = linkToPost.replace("'", "")
        linkToPost = linkToPost.replace("]", "")
        print linkToPost
        #Parse request to follow the posting link into the actual post
        request = scrapy.Request(linkToPost , callback=self.parse_item_page)
        request.meta['item'] = item
        yield request

Answer 1

这是因为 extract() 会 return 你 list:

extract()

Serialize and return the matched nodes as a list of unicode strings. Percent encoded content is unquoted.

这里最 "scrapic" 的方法是使用 ItemLoader and the TakeFirst or Join 处理器。

或者，只从列表中获取第一个元素：

item['title'] = sel.xpath('text()').extract()[0]
item['link'] = sel.xpath('@href').extract()[0]

如何在 Python 中为 scrapy 机器人去除不同末端的字符串？

How can I strip a string with different ends in Python for scrapy bot?

python

string

scrapy

web-scraping

scrapy-spider