使用组和嵌套正则表达式的组命名(来自文本文件的单位转换)

Group naming with group and nested regex (unit conversion from text file)

基本问题:

如何使用另一个组值命名 python 正则表达式组并将其嵌套在更大的正则表达式组中?

问题来源:

给定一个字符串,例如 'Your favorite song is 1 hour 23 seconds long. My phone only records for 1 h 30 mins and 10 secs.'

提取时间并转换为给定单位的优雅解决方案是什么?

尝试的解决方案:

我对解决方案的最佳猜测是创建一个字典,然后对字典执行操作以转换为所需的单位。

即将给定的字符串转换为:

string[0]:
 {'time1': {'day':0, 'hour':1, 'minutes':0, 'seconds':23, 'milliseconds':0}, 'time2': {'day':0, 'hour':1, 'minutes':30, 'seconds':10, 'milliseconds':0}}

string[1]:
 {'time1': {'day':4, 'hour':2, 'minutes':3, 'seconds':6, 'milliseconds':30}}

我有一个正则表达式解决方案,但它没有按照我的意愿运行:

import re

test_string = ['Your favorite song is 1 hour 23 seconds long.  My phone only records for 1h 30 mins and 10 secs.',
                'This video is 4 days 2h 3min 6sec 30ms']

year_units = ['year', 'years', 'y']
day_units = ['day', 'days', 'd']
hour_units = ['hour', 'hours', 'h']
min_units = ['minute', 'minutes', 'min', 'mins', 'm']
sec_units = ['second', 'seconds', 'sec', 'secs', 's']
millisec_units = ['millisecond', 'milliseconds', 'millisec', 'millisecs', 'ms']
all_units = '|'.join(year_units + day_units + hour_units + min_units + sec_units + millisec_units)
print((all_units))

# pattern = r"""(?P<time>               # time group beginning
#               (?P<value>[\d]+)    # value of time unit
#               \s*                 # may or may not be space between digit and unit
#               (?P<unit>%s)        # unit measurement of time
#               \s*                 # may or may not be space between digit and unit
#           )
#           \w+""" % all_units
pattern = r""".*(?P<time>       # time group beginning
            (?P<value>[\d]+)    # value of time unit
            \s*                 # may or may not be space between digit and unit
            (?P<unit>%s)        # unit measurement of time
            \s*                 # may or may not be space between digit and unit
            ).*                 # may be words in between the times 
            """ % (all_units)

regex = re.compile(pattern)
for val in test_string:
    match = regex.search(val)
    print(match)
    print(match.groupdict())

由于无法正确处理嵌套分组并且无法使用组的值分配名称,因此失败得很惨。

首先,如果不使用 re.VERBOSE 标志,您不能只编写带有注释的多行正则表达式并期望它匹配任何内容:

regex = re.compile(pattern, re.VERBOSE)

就像你说的,最好的解决办法可能是使用字典

for val in test_string:
    while True: #find all times
        match = regex.search(val) #find the first unit
        if not match:
            break
        matches= {} # keep track of all units and their values
        while True:
            matches[match.group('unit')]= int(match.group('value')) # add the match to the dict
            val= val[match.end():] # remove part of the string so subsequent matches must start at index 0
            m= regex.search(val)
            if not m or m.start()!=0: # if there are no more matches or there's text between this match and the next, abort
                break
            match= m
        print matches # the finished dict

# output will be like {'h': 1, 'secs': 10, 'mins': 30}

但是,上面的代码目前还不能运行。我们需要做两个调整:

  • 该模式不能在匹配之间只允许 任何 文本。要在两个匹配项之间只允许空格和单词 "and",您可以使用

    pattern = r"""(?P<time> # time group beginning (?P<value>[\d]+) # value of time unit \s* # may or may not be space between digit and unit (?P<unit>%s) # unit measurement of time \s* # may or may not be space between digit and unit (?:\band\s+)? # allow the word "and" between numbers ) # may be words in between the times """ % (all_units)

  • 您必须像这样更改单位的顺序:

    year_units = ['years', 'year', 'y'] # yearS before year day_units = ['days', 'day', 'd'] # dayS before day, etc...

    为什么?因为如果你有像 3 years and 1 day 这样的文本,那么它将匹配 3 year 而不是 3 years and.