Python

Question

只是 scrapy.org 的新用户和 Python 的新手。我在 brand 和 title 属性（JAVA OOP Term）中有这个值，其中包含制表符空格和新行。我们如何 trim 使这 2 个以下对象属性具有此纯字符串值

item['brand'] = "KORAL ACTIVEWEAR"
item['title'] = "Boom Leggings"

下面是数据结构

{'store_id': 870, 'sale_price_low': [], 'brand': [u'\n                KORAL ACTIVEWEAR\n              '], 'currency': 'AUD', 'retail_price': [u'0.00'], 'category': [u'Activewear'], 'title': [u'\n                Boom Leggings\n              '], 'url': [u'/boom-leggings-koral-activewear/vp/v=1/1524019474.htm?folderID=13331&fm=other-shopbysize-viewall&os=false&colorId=68136'], 'sale_price_high': [], 'image_url': [u'  https://images-na.sample-store.com/images/G/01/samplestore/p/prod/products/kacti/kacti3025868136/kacti3025868136_q1_2-0._SH20_QL90_UY365_.jpg\n'], 'category_link': 'https://www.samplestore.com/clothing-activewear/br/v=1/13331.htm?baseIndex=500', 'store': 'SampleStore'}

我能够 trim 通过使用正则表达式搜索方法只获取数字和小数的价格，我认为当有价格逗号分隔符时这可能是错误的。

price = re.compile('[0-9\.]+')
item['retail_price'] = filter(price.search, item['retail_price'])

Answer 1

看起来您需要做的所有事情，至少对于这个例子来说，就是去掉 brand 和 title 值边缘的所有白色 space。您不需要正则表达式，只需调用 strip 方法即可。

但是，您的 brand 不是单个字符串；它是一个字符串列表（即使列表中只有一个字符串）。因此，如果您尝试仅 strip 它，或运行正则表达式，您将通过尝试将该列表视为 AttributeError 或 TypeError一个字符串。

要解决此问题，您需要使用 map 函数或列表理解将 strip 映射到所有字符串：

item['brand'] = [brand.strip() for brand in item['brand']]
item['title'] = map(str.strip, item['title'])

…两者哪个你更容易理解

如果您有其他示例嵌入了运行的白色 space，并且您希望将每个这样的运行变成一个 space 字符，您需要在正则表达式中使用 sub 方法：

item['brand'] = [re.sub(ur'\s+', u' ', brand.strip() for brand in item['brand']]

注意 u 前缀。在 Python 2 中，您需要一个 u 前缀来生成 unicode 文字而不是 str（编码字节）文字。对 Unicode 字符串使用 Unicode 模式很重要，即使模式本身不关心任何非 ASCII 字符。（如果所有这一切看起来像是毫无意义的痛苦和诱饵——好吧，确实如此；这就是 Python 3 存在的主要原因。）

至于 retail_price，同样的基本观察也适用。同样，它是一个字符串列表，而不仅仅是一个字符串。再一次，你可能不需要正则表达式。假设价格始终是 $（或其他单字符货币标记）后跟数字，只需切掉 $ 并在其上调用 float 或 Decimal：

item['retail_price'] = [float(price[1:]) for price in item['retail_price']]

… 但是如果你有看起来不同的例子，价格两边有任意额外的字符，你可以在这里使用 re.search，但你仍然需要映射它，并使用Unicode 模式。

您还需要从搜索中抓取匹配的 group，并以某种方式处理 empty/invalid 字符串（它们将 return None搜索，您无法将其转换为 float）。您必须决定如何处理它，但从您对 filter 的尝试来看，您似乎只想跳过它们。这太复杂了，我会分多个步骤来完成：

prices = item['price']
matches = (re.search(r'[0-9.]+', price) for price in prices)
groups = (match.group() for match in matches if match)
item['price'] = map(float, validmatches)

…或者将其包装在一个函数中。

Answer 2

您可以定义一个像下面这样的方法，它接受一个对象并且returns所有叶子都被归一化。

import six

def normalize(obj):
    if isinstance(obj, six.string_types):
        return ' '.join(obj.split())
    elif isinstance(obj, list):
        return [normalize(x) for x in obj]
    elif isinstance(obj, dict):
        return {k:normalize(v) for k,v in obj.items()}
    return obj

这是一种递归方法，不会修改原始对象，而是 returns 规范化对象。您也可以使用它来规范化字符串。

对于您的示例项目

>> item = {'store_id': 870, 'sale_price_low': [], 'brand': [u'\n                KORAL ACTIVEWEAR\n              '], 'currency': 'AUD', 'retail_price': [u'0.00'], 'category': [u'Activewear'], 'title': [u'\n                Boom Leggings\n              '], 'url': [u'/boom-leggings-koral-activewear/vp/v=1/1524019474.htm?folderID=13331&fm=other-shopbysize-viewall&os=false&colorId=68136'], 'sale_price_high': [], 'image_url': [u'  https://images-na.sample-store.com/images/G/01/samplestore/p/prod/products/kacti/kacti3025868136/kacti3025868136_q1_2-0._SH20_QL90_UY365_.jpg\n'], 'category_link': 'https://www.samplestore.com/clothing-activewear/br/v=1/13331.htm?baseIndex=500', 'store': 'SampleStore'}

>> print (normalize(item))
>> {'category': [u'Activewear'], 'store_id': 870, 'sale_price_low': [], 'title': [u'Boom Leggings'], 'url': [u'/boom-leggings-koral-activewear/vp/v=1/1524019474.htm?folderID=13331&fm=other-shopbysize-viewall&os=false&colorId=68136'], 'brand': [u'KORAL ACTIVEWEAR'], 'currency': 'AUD', 'image_url': [u'https://images-na.sample-store.com/images/G/01/samplestore/p/prod/products/kacti/kacti3025868136/kacti3025868136_q1_2-0._SH20_QL90_UY365_.jpg'], 'category_link': 'https://www.samplestore.com/clothing-activewear/br/v=1/13331.htm?baseIndex=500', 'sale_price_high': [], 'retail_price': [u'0.00'], 'store': 'SampleStore'}

Python - 删除对象中的制表符和新行

Python - Remove tab and new line in Object

scrapy

python-2.7

scrapy-pipeline