Remove/Exclude 来自 Scrapy 结果的非破坏性 Space

Question

我目前正在尝试抓取网站上的文章价格，但我运行遇到了一个问题（在以某种方式解决了动态生成价格的问题之后，这是一个巨大的痛苦）。

我可以毫无问题地收到价格和商品名称，但是 'price' 的每秒结果都是“\xa0”。我尝试使用 'normalize-space()' 删除它，但无济于事。

我的代码：

import scrapy
from scrapy import signals
from scrapy.http import TextResponse
from scrapy.xlib.pydispatch import dispatcher
from horni.items import HorniItem

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time
from selenium.webdriver.common.keys import Keys

class mySpider(scrapy.Spider):
    name = "placeholder"
    allowed_domains = ["placeholder.com"]
    start_urls = ["https://www.placeholder.com"]

    def __init__(self):
        self.driver = webdriver.Chrome()
        dispatcher.connect(self.spider_closed, signals.spider_closed)

    def spider_closed(self, spider):
        self.driver.close()

    def parse(self, response):
        self.driver.get("https://www.placeholder.com")
        response = TextResponse(url=self.driver.current_url, body=self.driver.page_source, encoding='utf-8')
        for post in response.xpath('//body'):
            item = myItem()
            item['article_name'] = post.xpath('//a[@class="title-link"]/span/text()').extract()
            item['price'] = post.xpath('//p[@class="display-price"]/span]/text()').extract()
            yield item

Answer 1

\xa0 是 Latin1 中的不间断 space。像这样替换它：

string = string.replace(u'\xa0', u' ')

更新：

您可以应用以下代码：

for post in response.xpath('//body'):
    item = myItem()
    item['article_name'] = post.xpath('//a[@class="title-link"]/span/text()').extract()
    item['price'] = post.xpath('//p[@class="display-price"]/span]/text()').extract()
    item['price'] = item['price'].replace(u'\xa0', u' ')
    if(item['price'].strip()):
        yield item

在这里您替换字符，然后仅在价格不为空时生成该项目。

Remove/Exclude 来自 Scrapy 结果的非破坏性 Space

Remove/Exclude Non-Breaking Space from Scrapy result

python

scrapy