正则表达式用 Python 中的数字字符和后续值拆分字符串

Regex split strings with numeric char and subsequent values in Python

具有此值列表:

['Champiñón 200 g',
'Zapallo italiano Unid.',
'Bolsa de zanahoria 1 kg',
'Papa malla 2 Kg',
'Palta Hass granel',
'Limón malla 1 kg',
'Tomate granel',
'Brócoli 1 un.',
'Tomate  unid']

如何使用 re.split() 拆分此列表以获得此形式:

['Champiñón' , '200 g',
'Zapallo italiano' , 'Unid.',
'Bolsa de zanahoria' ,'1 kg',
'Papa malla' ,'2 Kg',
'Palta Hass granel',
'Limón malla' ,'1 kg',
'Tomate granel',
'Brócoli' ,'1 un.',
'Tomate'  ,'unid']

在解析情况下,split() 通常在您想丢弃要拆分的数据时效果最好。但是您想保留它,所以您最好使用捕获方法。

import re

orig_vals = [
    'Champiñón 200 g',
    'Zapallo italiano Unid.',
    'Bolsa de zanahoria 1 kg',
    'Papa malla 2 Kg',
    'Palta Hass granel',
    'Limón malla 1 kg',
    'Tomate granel',
    'Brócoli 1 un.',
    'Tomate  unid',
]

# We will capture the two parts of interest and
# only throw away a space in the middle. This regex is
# not super robust, but it does work correctly for the
# example data you have supplied.
rgx = re.compile('(.+) ((\d|unid).*)', re.IGNORECASE)

new_vals = []
for ov in orig_vals:
    m = rgx.search(ov)
    new_vals.extend([m.group(1).rstrip(), m.group(2)] if m else [ov])

如果你真的想使用拆分,你可以编写一个更复杂的正则表达式,使用前瞻——以防止消耗并因此丢弃我们正在拆分的文本。

rgx2 = re.compile('(.+?) +(?=\d|unid)', re.IGNORECASE)

new_vals2 = [
    part
    for ov in orig_vals
    for part in rgx2.split(ov)
    if part
]

你可以这样做:

import re

data = ['Champiñón 200 g',
'Zapallo italiano Unid.',
'Bolsa de zanahoria 1 kg',
'Papa malla 2 Kg',
'Palta Hass granel',
'Limón malla 1 kg',
'Tomate granel',
'Brócoli 1 un.',
'Tomate  unid']


splitted = []

for line in data:
    value, unit, *_ = *re.split(' ((\d|unid).*)', line, flags=re.IGNORECASE), ''

    splitted.append(value)

    if unit:
        splitted.append(unit)

print(splitted)