在 python 的数据框列中找到正则表达式模式以分隔未格式化的逗号分隔值

finding regex pattern to separated unformatted comma separated values in the column of a dataframe in python

嗨,我必须预处理一个具有逗号分隔值的列,我不能应用 .split(',\s*') 因为有些地方逗号和空格不应该分开,所以我我正在寻找正则表达式模式。

列:

    0          12noon to 3:30pm, 6:30pm to 11:30pm (Mon-Sun)
    1                                         11 AM to 11 PM
    2                  11:30 AM to 4:30 PM, 6:30 PM to 11 PM
    3                                        12 Noon to 2 AM
    4      12noon to 11pm (Mon, Tue, Wed, Thu, Sun), 12no...
                         ...                        
    100                                       11 AM to 11 PM
    101    10 AM to 10 PM (Mon-Thu), 8 AM to 10:30 PM (Fr...
    102                                     12 Noon to 11 PM
    103                             8am to 12:30AM (Mon-Sun)
    104                11:30 AM to 3 PM, 7 PM to 12 Midnight

我试过的是

    import re
    pattern = '([\w+\:*\s*\w*(w{2})*]*\s*to\s*[\w+\:*\s*\w*(w{2})*]*\s*[\([a-zA-Z]*\-*\,*\s* 
    [a-zA-Z]*\s*\)]*)'
    timing = data['timings'].str.lower().str.split(pattern).dropna().to_numpy()

输出:

   array([list(['12noon to 3:30pm,', ' 6:30pm to 11:30pm (mon-sun)', '']),
   list(['11 am to 11 pm']),
   list(['11:30 am to 4:30 pm, 6:30 pm to 11 pm']),
   list(['12 noon to 2 am']),
   list(['12noon to 11pm (mon, tue, wed, thu, sun),', ' 12noon to 12midnight (fri-sat)', '']),
   list(['12noon to 3:30pm, 4pm to 6:30pm, 7pm to 11:30pm (mon, tue, wed, thu, sun), 12noon to 3:30pm, 4pm to 6:30pm,', ' 7pm to 12midnight (fri-sat)', '']),
   list(['7 am to 10 pm']), list(['12 noon to 12 midnight']),
   list(['12 noon to 12 midnight']),
   list(['', '10 am to 1 am (mon-thu)', ',', ' 10 am to 1:30 am (fri-sun)', '']),
   list(['12 noon to 3:30 pm, 7 pm to 10:30 pm']),
   list(['12 noon to 3:30 pm, 6:30 pm to 11:30 pm']),
   list(['11:30 am to 1 am']),
   list(['', '12noon to 12midnight (mon-sun)', '']),
   list(['12 noon to 4:30 pm, 6:30 pm to 11:30 pm']),
   list(['11 am to 11 pm']), list(['12 noon to 10:30 pm']),
   list(['11:30 am to 1 am']), list(['12 noon to 12 midnight']),
   list(['12 noon to 11 pm']),
   list(['', '12:30 pm to 10 pm (tue-sun)', ', mon closed']),
   list(['11:30 am to 3 pm, 7 pm to 11 pm']),
   list(['11am to 11:30pm (mon, tue, wed, thu, sun),', ' 11am to 12midnight (fri-sat)', '']),
   list(['10 am to 5 am']),
   list(['12 noon to 12 midnight (mon-thu, sun),', ' 12 noon to 1 am (fri-sat)', '']),
   list(['', '12noon to 11pm (mon-thu)', ',', '12noon to 11:30pm (fri-sun)', '']),
   list(['', '12 noon to 11:30 pm (mon-wed)', ',', ' 12 noon to 1 am (fri-sat)', ',', ' 12 noon to 12 midnight (sun)', ', thu closed']),
   list(['12 noon to 4 pm, 6:30 pm to 11:30 pm']),
   list(['10 am to 1 am']), list(['4:30 pm to 5:30 am']),
   list(['11 am to 12 midnight']),
   list(['12noon to 4pm,', ' 7pm to 12midnight (mon-sun)', '']),
   list(['11 am to 12 midnight']),
   list(['', '6am to 12midnight (mon-sun)', '']),
   list(['12 noon to 11 pm']),
   list(['12:30 pm to 3:30 pm, 7 pm to 10:40 pm']),
   list(['12 noon to 4 pm, 7 pm to 11 pm']),
   list(['12noon to 11pm (mon, tue, wed, thu, sun),', ' 12noon to 12midnight (fri-sat)', '']),
   list(['12 noon to 10:30 pm']),
   list(['', '12noon to 11pm (mon-sun)', '']),
   list(['10 am to 10 pm']), list(['10 am to 10 pm']),
   list(['7 am to 1 am']), list(['12 noon to 11:30 pm']),
   list(['', '12noon to 11:30pm (mon-sun)', '']),
   list(['12 noon to 11:30 pm']), list(['12 noon to 11 pm']),
   list(['6 am to 10:30 pm']),
   list(['11:30 am to 3:30 pm, 6:45 pm to 11:30 pm']),
   list(['11:55 am to 4 pm, 7 pm to 11:15 pm']),
   list(['12 noon to 11 pm']), list(['11 am to 11 pm']),
   list(['12noon to 4:30pm, 6:30pm to 11:30pm (mon, tue, wed, fri, sat), closed (thu),', '12noon to 12midnight (sun)', '']),
   list(['12noon to 12midnight (mon, tue, wed, thu, sun),', ' 12noon to 1am (fri-sat)', '']),
   list(['8 am to 11:30 pm']),
   list(['6:30am to 10:30am, 12:30pm to 3pm,', ' 7pm to 11pm (mon)', ',6:30am to 10:30am, 12:30pm to 3pm,', ' 7:30pm to 11pm (tue-sat)', ',6:30am to 10:30am, 12:30pm to 3:30pm,', ' 7pm to 11pm (sun)', '']),
   list(['12 noon to 3 pm, 7 pm to 11:30 pm']),
   list(['11:30 am to 1 am']), list(['9 am to 10 pm']),
   list(['12 noon to 12 midnight (mon-thu, sun),', ' 12 noon to 1 am (fri-sat)', '']),
   list(['', '5pm to 12midnight (mon-sun)', '']),
   list(['11 am to 11:30 pm']),
   list(['', '11:30am to 11pm (mon-sun)', '']),
   list(['12 noon to 10:30 pm']), list(['1 pm to 11 pm']),
   list(['11:30 am to 12 midnight']),
   list(['12 noon to 12 midnight']),
   list(['', '12noon to 12midnight (mon-sun)', '']),
   list(['', '12noon to 11pm (mon-sun)', '']),
   list(['12 noon to 3 pm, 7 pm to 11 pm']),
   list(['12 noon to 3 pm, 7 pm to 11 pm']),
   list(['', '11 am to 8 pm (mon-sat)', ', sun closed']),
   list(['4 am to 12 midnight']), list(['9 am to 1 am']),
   list(['10:30 am to 11 pm']), list(['7 am to 11 pm']),
   list(['7 am to 10:30 am, 12:30 pm to 3:30 pm, 7 pm to 11 pm']),
   list(['12 noon to 3:30 pm, 7 pm to 11:30 pm']),
   list(['12 noon to 3:30 pm, 7 pm to 11 pm']),
   list(['12noon to 12midnight (mon, tue, wed, thu, sun),', ' 12noon to 1am (fri-sat)', '']),
   list(['', '11am to 11pm (mon-sun)', '']),
   list(['6 am to 11:30 pm']), list(['11:30 am to 5 am']),
   list(['12:30 pm to 3:30 pm, 7 pm to 11 pm']),
   list(['', '6pm to 2am (mon-sun)', '']),......)

但我想要的是这样的:

    [['6pm to 2am (mon-sun)'], ['12 noon to 12 midnight (mon-thu, sun)'] .....] something like this

我想我必须设计一个更好的正则表达式模式来分离这些值。那么谁能设计出更好的正则表达式模式呢?提前致谢:).

这是我的尝试:

import re, pandas
data = pandas.read_excel('C:\Users\Administrator\Desktop\test.xls')
pattern = '(\d{1,2}(?:\:\d{1,2})? ?(?:\w{2,8}) to \d{1,2}(?:\:\d{1,2})? ?(?:\w{2,8}) ?(?:\(\w{3}(?:[ ,-]{1,3}\w{3}){0,6}\))?)'
re.findall(pattern, data["myData"].str.cat(sep=", "))

调用 re.findall() 我的输出是:

['12noon to 3:30pm', '6:30pm to 11:30pm (Mon-Sun)', '11 AM to 11 PM', '11:30 AM to 4:30 PM', '6:30 PM to 11 PM', '12 Noon to 2 AM', '11 AM to 11 PM', '10 AM to 10 PM (Mon-Thu)', '8 AM to 10:30 PM (Fri,Sat)', '12 Noon to 11 PM', '8am to 12:30AM (Mon-Sun)', '11:30 AM to 3 PM', '7 PM to 12 Midnight']