正则表达式用 Python 中的数字字符和后续值拆分字符串
Regex split strings with numeric char and subsequent values in Python
具有此值列表:
['Champiñón 200 g',
'Zapallo italiano Unid.',
'Bolsa de zanahoria 1 kg',
'Papa malla 2 Kg',
'Palta Hass granel',
'Limón malla 1 kg',
'Tomate granel',
'Brócoli 1 un.',
'Tomate unid']
如何使用 re.split() 拆分此列表以获得此形式:
['Champiñón' , '200 g',
'Zapallo italiano' , 'Unid.',
'Bolsa de zanahoria' ,'1 kg',
'Papa malla' ,'2 Kg',
'Palta Hass granel',
'Limón malla' ,'1 kg',
'Tomate granel',
'Brócoli' ,'1 un.',
'Tomate' ,'unid']
在解析情况下,split()
通常在您想丢弃要拆分的数据时效果最好。但是您想保留它,所以您最好使用捕获方法。
import re
orig_vals = [
'Champiñón 200 g',
'Zapallo italiano Unid.',
'Bolsa de zanahoria 1 kg',
'Papa malla 2 Kg',
'Palta Hass granel',
'Limón malla 1 kg',
'Tomate granel',
'Brócoli 1 un.',
'Tomate unid',
]
# We will capture the two parts of interest and
# only throw away a space in the middle. This regex is
# not super robust, but it does work correctly for the
# example data you have supplied.
rgx = re.compile('(.+) ((\d|unid).*)', re.IGNORECASE)
new_vals = []
for ov in orig_vals:
m = rgx.search(ov)
new_vals.extend([m.group(1).rstrip(), m.group(2)] if m else [ov])
如果你真的想使用拆分,你可以编写一个更复杂的正则表达式,使用前瞻——以防止消耗并因此丢弃我们正在拆分的文本。
rgx2 = re.compile('(.+?) +(?=\d|unid)', re.IGNORECASE)
new_vals2 = [
part
for ov in orig_vals
for part in rgx2.split(ov)
if part
]
你可以这样做:
import re
data = ['Champiñón 200 g',
'Zapallo italiano Unid.',
'Bolsa de zanahoria 1 kg',
'Papa malla 2 Kg',
'Palta Hass granel',
'Limón malla 1 kg',
'Tomate granel',
'Brócoli 1 un.',
'Tomate unid']
splitted = []
for line in data:
value, unit, *_ = *re.split(' ((\d|unid).*)', line, flags=re.IGNORECASE), ''
splitted.append(value)
if unit:
splitted.append(unit)
print(splitted)
具有此值列表:
['Champiñón 200 g',
'Zapallo italiano Unid.',
'Bolsa de zanahoria 1 kg',
'Papa malla 2 Kg',
'Palta Hass granel',
'Limón malla 1 kg',
'Tomate granel',
'Brócoli 1 un.',
'Tomate unid']
如何使用 re.split() 拆分此列表以获得此形式:
['Champiñón' , '200 g',
'Zapallo italiano' , 'Unid.',
'Bolsa de zanahoria' ,'1 kg',
'Papa malla' ,'2 Kg',
'Palta Hass granel',
'Limón malla' ,'1 kg',
'Tomate granel',
'Brócoli' ,'1 un.',
'Tomate' ,'unid']
在解析情况下,split()
通常在您想丢弃要拆分的数据时效果最好。但是您想保留它,所以您最好使用捕获方法。
import re
orig_vals = [
'Champiñón 200 g',
'Zapallo italiano Unid.',
'Bolsa de zanahoria 1 kg',
'Papa malla 2 Kg',
'Palta Hass granel',
'Limón malla 1 kg',
'Tomate granel',
'Brócoli 1 un.',
'Tomate unid',
]
# We will capture the two parts of interest and
# only throw away a space in the middle. This regex is
# not super robust, but it does work correctly for the
# example data you have supplied.
rgx = re.compile('(.+) ((\d|unid).*)', re.IGNORECASE)
new_vals = []
for ov in orig_vals:
m = rgx.search(ov)
new_vals.extend([m.group(1).rstrip(), m.group(2)] if m else [ov])
如果你真的想使用拆分,你可以编写一个更复杂的正则表达式,使用前瞻——以防止消耗并因此丢弃我们正在拆分的文本。
rgx2 = re.compile('(.+?) +(?=\d|unid)', re.IGNORECASE)
new_vals2 = [
part
for ov in orig_vals
for part in rgx2.split(ov)
if part
]
你可以这样做:
import re
data = ['Champiñón 200 g',
'Zapallo italiano Unid.',
'Bolsa de zanahoria 1 kg',
'Papa malla 2 Kg',
'Palta Hass granel',
'Limón malla 1 kg',
'Tomate granel',
'Brócoli 1 un.',
'Tomate unid']
splitted = []
for line in data:
value, unit, *_ = *re.split(' ((\d|unid).*)', line, flags=re.IGNORECASE), ''
splitted.append(value)
if unit:
splitted.append(unit)
print(splitted)