在 python 的数据框列中找到正则表达式模式以分隔未格式化的逗号分隔值
finding regex pattern to separated unformatted comma separated values in the column of a dataframe in python
嗨,我必须预处理一个具有逗号分隔值的列,我不能应用 .split(',\s*') 因为有些地方逗号和空格不应该分开,所以我我正在寻找正则表达式模式。
列:
0 12noon to 3:30pm, 6:30pm to 11:30pm (Mon-Sun)
1 11 AM to 11 PM
2 11:30 AM to 4:30 PM, 6:30 PM to 11 PM
3 12 Noon to 2 AM
4 12noon to 11pm (Mon, Tue, Wed, Thu, Sun), 12no...
...
100 11 AM to 11 PM
101 10 AM to 10 PM (Mon-Thu), 8 AM to 10:30 PM (Fr...
102 12 Noon to 11 PM
103 8am to 12:30AM (Mon-Sun)
104 11:30 AM to 3 PM, 7 PM to 12 Midnight
我试过的是
import re
pattern = '([\w+\:*\s*\w*(w{2})*]*\s*to\s*[\w+\:*\s*\w*(w{2})*]*\s*[\([a-zA-Z]*\-*\,*\s*
[a-zA-Z]*\s*\)]*)'
timing = data['timings'].str.lower().str.split(pattern).dropna().to_numpy()
输出:
array([list(['12noon to 3:30pm,', ' 6:30pm to 11:30pm (mon-sun)', '']),
list(['11 am to 11 pm']),
list(['11:30 am to 4:30 pm, 6:30 pm to 11 pm']),
list(['12 noon to 2 am']),
list(['12noon to 11pm (mon, tue, wed, thu, sun),', ' 12noon to 12midnight (fri-sat)', '']),
list(['12noon to 3:30pm, 4pm to 6:30pm, 7pm to 11:30pm (mon, tue, wed, thu, sun), 12noon to 3:30pm, 4pm to 6:30pm,', ' 7pm to 12midnight (fri-sat)', '']),
list(['7 am to 10 pm']), list(['12 noon to 12 midnight']),
list(['12 noon to 12 midnight']),
list(['', '10 am to 1 am (mon-thu)', ',', ' 10 am to 1:30 am (fri-sun)', '']),
list(['12 noon to 3:30 pm, 7 pm to 10:30 pm']),
list(['12 noon to 3:30 pm, 6:30 pm to 11:30 pm']),
list(['11:30 am to 1 am']),
list(['', '12noon to 12midnight (mon-sun)', '']),
list(['12 noon to 4:30 pm, 6:30 pm to 11:30 pm']),
list(['11 am to 11 pm']), list(['12 noon to 10:30 pm']),
list(['11:30 am to 1 am']), list(['12 noon to 12 midnight']),
list(['12 noon to 11 pm']),
list(['', '12:30 pm to 10 pm (tue-sun)', ', mon closed']),
list(['11:30 am to 3 pm, 7 pm to 11 pm']),
list(['11am to 11:30pm (mon, tue, wed, thu, sun),', ' 11am to 12midnight (fri-sat)', '']),
list(['10 am to 5 am']),
list(['12 noon to 12 midnight (mon-thu, sun),', ' 12 noon to 1 am (fri-sat)', '']),
list(['', '12noon to 11pm (mon-thu)', ',', '12noon to 11:30pm (fri-sun)', '']),
list(['', '12 noon to 11:30 pm (mon-wed)', ',', ' 12 noon to 1 am (fri-sat)', ',', ' 12 noon to 12 midnight (sun)', ', thu closed']),
list(['12 noon to 4 pm, 6:30 pm to 11:30 pm']),
list(['10 am to 1 am']), list(['4:30 pm to 5:30 am']),
list(['11 am to 12 midnight']),
list(['12noon to 4pm,', ' 7pm to 12midnight (mon-sun)', '']),
list(['11 am to 12 midnight']),
list(['', '6am to 12midnight (mon-sun)', '']),
list(['12 noon to 11 pm']),
list(['12:30 pm to 3:30 pm, 7 pm to 10:40 pm']),
list(['12 noon to 4 pm, 7 pm to 11 pm']),
list(['12noon to 11pm (mon, tue, wed, thu, sun),', ' 12noon to 12midnight (fri-sat)', '']),
list(['12 noon to 10:30 pm']),
list(['', '12noon to 11pm (mon-sun)', '']),
list(['10 am to 10 pm']), list(['10 am to 10 pm']),
list(['7 am to 1 am']), list(['12 noon to 11:30 pm']),
list(['', '12noon to 11:30pm (mon-sun)', '']),
list(['12 noon to 11:30 pm']), list(['12 noon to 11 pm']),
list(['6 am to 10:30 pm']),
list(['11:30 am to 3:30 pm, 6:45 pm to 11:30 pm']),
list(['11:55 am to 4 pm, 7 pm to 11:15 pm']),
list(['12 noon to 11 pm']), list(['11 am to 11 pm']),
list(['12noon to 4:30pm, 6:30pm to 11:30pm (mon, tue, wed, fri, sat), closed (thu),', '12noon to 12midnight (sun)', '']),
list(['12noon to 12midnight (mon, tue, wed, thu, sun),', ' 12noon to 1am (fri-sat)', '']),
list(['8 am to 11:30 pm']),
list(['6:30am to 10:30am, 12:30pm to 3pm,', ' 7pm to 11pm (mon)', ',6:30am to 10:30am, 12:30pm to 3pm,', ' 7:30pm to 11pm (tue-sat)', ',6:30am to 10:30am, 12:30pm to 3:30pm,', ' 7pm to 11pm (sun)', '']),
list(['12 noon to 3 pm, 7 pm to 11:30 pm']),
list(['11:30 am to 1 am']), list(['9 am to 10 pm']),
list(['12 noon to 12 midnight (mon-thu, sun),', ' 12 noon to 1 am (fri-sat)', '']),
list(['', '5pm to 12midnight (mon-sun)', '']),
list(['11 am to 11:30 pm']),
list(['', '11:30am to 11pm (mon-sun)', '']),
list(['12 noon to 10:30 pm']), list(['1 pm to 11 pm']),
list(['11:30 am to 12 midnight']),
list(['12 noon to 12 midnight']),
list(['', '12noon to 12midnight (mon-sun)', '']),
list(['', '12noon to 11pm (mon-sun)', '']),
list(['12 noon to 3 pm, 7 pm to 11 pm']),
list(['12 noon to 3 pm, 7 pm to 11 pm']),
list(['', '11 am to 8 pm (mon-sat)', ', sun closed']),
list(['4 am to 12 midnight']), list(['9 am to 1 am']),
list(['10:30 am to 11 pm']), list(['7 am to 11 pm']),
list(['7 am to 10:30 am, 12:30 pm to 3:30 pm, 7 pm to 11 pm']),
list(['12 noon to 3:30 pm, 7 pm to 11:30 pm']),
list(['12 noon to 3:30 pm, 7 pm to 11 pm']),
list(['12noon to 12midnight (mon, tue, wed, thu, sun),', ' 12noon to 1am (fri-sat)', '']),
list(['', '11am to 11pm (mon-sun)', '']),
list(['6 am to 11:30 pm']), list(['11:30 am to 5 am']),
list(['12:30 pm to 3:30 pm, 7 pm to 11 pm']),
list(['', '6pm to 2am (mon-sun)', '']),......)
但我想要的是这样的:
[['6pm to 2am (mon-sun)'], ['12 noon to 12 midnight (mon-thu, sun)'] .....] something like this
我想我必须设计一个更好的正则表达式模式来分离这些值。那么谁能设计出更好的正则表达式模式呢?提前致谢:).
这是我的尝试:
import re, pandas
data = pandas.read_excel('C:\Users\Administrator\Desktop\test.xls')
pattern = '(\d{1,2}(?:\:\d{1,2})? ?(?:\w{2,8}) to \d{1,2}(?:\:\d{1,2})? ?(?:\w{2,8}) ?(?:\(\w{3}(?:[ ,-]{1,3}\w{3}){0,6}\))?)'
re.findall(pattern, data["myData"].str.cat(sep=", "))
调用 re.findall()
我的输出是:
['12noon to 3:30pm', '6:30pm to 11:30pm (Mon-Sun)', '11 AM to 11 PM', '11:30 AM to 4:30 PM', '6:30 PM to 11 PM', '12 Noon to 2 AM', '11 AM to 11 PM', '10 AM to 10 PM (Mon-Thu)', '8 AM to 10:30 PM (Fri,Sat)', '12 Noon to 11 PM', '8am to 12:30AM (Mon-Sun)', '11:30 AM to 3 PM', '7 PM to 12 Midnight']
嗨,我必须预处理一个具有逗号分隔值的列,我不能应用 .split(',\s*') 因为有些地方逗号和空格不应该分开,所以我我正在寻找正则表达式模式。
列:
0 12noon to 3:30pm, 6:30pm to 11:30pm (Mon-Sun)
1 11 AM to 11 PM
2 11:30 AM to 4:30 PM, 6:30 PM to 11 PM
3 12 Noon to 2 AM
4 12noon to 11pm (Mon, Tue, Wed, Thu, Sun), 12no...
...
100 11 AM to 11 PM
101 10 AM to 10 PM (Mon-Thu), 8 AM to 10:30 PM (Fr...
102 12 Noon to 11 PM
103 8am to 12:30AM (Mon-Sun)
104 11:30 AM to 3 PM, 7 PM to 12 Midnight
我试过的是
import re
pattern = '([\w+\:*\s*\w*(w{2})*]*\s*to\s*[\w+\:*\s*\w*(w{2})*]*\s*[\([a-zA-Z]*\-*\,*\s*
[a-zA-Z]*\s*\)]*)'
timing = data['timings'].str.lower().str.split(pattern).dropna().to_numpy()
输出:
array([list(['12noon to 3:30pm,', ' 6:30pm to 11:30pm (mon-sun)', '']),
list(['11 am to 11 pm']),
list(['11:30 am to 4:30 pm, 6:30 pm to 11 pm']),
list(['12 noon to 2 am']),
list(['12noon to 11pm (mon, tue, wed, thu, sun),', ' 12noon to 12midnight (fri-sat)', '']),
list(['12noon to 3:30pm, 4pm to 6:30pm, 7pm to 11:30pm (mon, tue, wed, thu, sun), 12noon to 3:30pm, 4pm to 6:30pm,', ' 7pm to 12midnight (fri-sat)', '']),
list(['7 am to 10 pm']), list(['12 noon to 12 midnight']),
list(['12 noon to 12 midnight']),
list(['', '10 am to 1 am (mon-thu)', ',', ' 10 am to 1:30 am (fri-sun)', '']),
list(['12 noon to 3:30 pm, 7 pm to 10:30 pm']),
list(['12 noon to 3:30 pm, 6:30 pm to 11:30 pm']),
list(['11:30 am to 1 am']),
list(['', '12noon to 12midnight (mon-sun)', '']),
list(['12 noon to 4:30 pm, 6:30 pm to 11:30 pm']),
list(['11 am to 11 pm']), list(['12 noon to 10:30 pm']),
list(['11:30 am to 1 am']), list(['12 noon to 12 midnight']),
list(['12 noon to 11 pm']),
list(['', '12:30 pm to 10 pm (tue-sun)', ', mon closed']),
list(['11:30 am to 3 pm, 7 pm to 11 pm']),
list(['11am to 11:30pm (mon, tue, wed, thu, sun),', ' 11am to 12midnight (fri-sat)', '']),
list(['10 am to 5 am']),
list(['12 noon to 12 midnight (mon-thu, sun),', ' 12 noon to 1 am (fri-sat)', '']),
list(['', '12noon to 11pm (mon-thu)', ',', '12noon to 11:30pm (fri-sun)', '']),
list(['', '12 noon to 11:30 pm (mon-wed)', ',', ' 12 noon to 1 am (fri-sat)', ',', ' 12 noon to 12 midnight (sun)', ', thu closed']),
list(['12 noon to 4 pm, 6:30 pm to 11:30 pm']),
list(['10 am to 1 am']), list(['4:30 pm to 5:30 am']),
list(['11 am to 12 midnight']),
list(['12noon to 4pm,', ' 7pm to 12midnight (mon-sun)', '']),
list(['11 am to 12 midnight']),
list(['', '6am to 12midnight (mon-sun)', '']),
list(['12 noon to 11 pm']),
list(['12:30 pm to 3:30 pm, 7 pm to 10:40 pm']),
list(['12 noon to 4 pm, 7 pm to 11 pm']),
list(['12noon to 11pm (mon, tue, wed, thu, sun),', ' 12noon to 12midnight (fri-sat)', '']),
list(['12 noon to 10:30 pm']),
list(['', '12noon to 11pm (mon-sun)', '']),
list(['10 am to 10 pm']), list(['10 am to 10 pm']),
list(['7 am to 1 am']), list(['12 noon to 11:30 pm']),
list(['', '12noon to 11:30pm (mon-sun)', '']),
list(['12 noon to 11:30 pm']), list(['12 noon to 11 pm']),
list(['6 am to 10:30 pm']),
list(['11:30 am to 3:30 pm, 6:45 pm to 11:30 pm']),
list(['11:55 am to 4 pm, 7 pm to 11:15 pm']),
list(['12 noon to 11 pm']), list(['11 am to 11 pm']),
list(['12noon to 4:30pm, 6:30pm to 11:30pm (mon, tue, wed, fri, sat), closed (thu),', '12noon to 12midnight (sun)', '']),
list(['12noon to 12midnight (mon, tue, wed, thu, sun),', ' 12noon to 1am (fri-sat)', '']),
list(['8 am to 11:30 pm']),
list(['6:30am to 10:30am, 12:30pm to 3pm,', ' 7pm to 11pm (mon)', ',6:30am to 10:30am, 12:30pm to 3pm,', ' 7:30pm to 11pm (tue-sat)', ',6:30am to 10:30am, 12:30pm to 3:30pm,', ' 7pm to 11pm (sun)', '']),
list(['12 noon to 3 pm, 7 pm to 11:30 pm']),
list(['11:30 am to 1 am']), list(['9 am to 10 pm']),
list(['12 noon to 12 midnight (mon-thu, sun),', ' 12 noon to 1 am (fri-sat)', '']),
list(['', '5pm to 12midnight (mon-sun)', '']),
list(['11 am to 11:30 pm']),
list(['', '11:30am to 11pm (mon-sun)', '']),
list(['12 noon to 10:30 pm']), list(['1 pm to 11 pm']),
list(['11:30 am to 12 midnight']),
list(['12 noon to 12 midnight']),
list(['', '12noon to 12midnight (mon-sun)', '']),
list(['', '12noon to 11pm (mon-sun)', '']),
list(['12 noon to 3 pm, 7 pm to 11 pm']),
list(['12 noon to 3 pm, 7 pm to 11 pm']),
list(['', '11 am to 8 pm (mon-sat)', ', sun closed']),
list(['4 am to 12 midnight']), list(['9 am to 1 am']),
list(['10:30 am to 11 pm']), list(['7 am to 11 pm']),
list(['7 am to 10:30 am, 12:30 pm to 3:30 pm, 7 pm to 11 pm']),
list(['12 noon to 3:30 pm, 7 pm to 11:30 pm']),
list(['12 noon to 3:30 pm, 7 pm to 11 pm']),
list(['12noon to 12midnight (mon, tue, wed, thu, sun),', ' 12noon to 1am (fri-sat)', '']),
list(['', '11am to 11pm (mon-sun)', '']),
list(['6 am to 11:30 pm']), list(['11:30 am to 5 am']),
list(['12:30 pm to 3:30 pm, 7 pm to 11 pm']),
list(['', '6pm to 2am (mon-sun)', '']),......)
但我想要的是这样的:
[['6pm to 2am (mon-sun)'], ['12 noon to 12 midnight (mon-thu, sun)'] .....] something like this
我想我必须设计一个更好的正则表达式模式来分离这些值。那么谁能设计出更好的正则表达式模式呢?提前致谢:).
这是我的尝试:
import re, pandas
data = pandas.read_excel('C:\Users\Administrator\Desktop\test.xls')
pattern = '(\d{1,2}(?:\:\d{1,2})? ?(?:\w{2,8}) to \d{1,2}(?:\:\d{1,2})? ?(?:\w{2,8}) ?(?:\(\w{3}(?:[ ,-]{1,3}\w{3}){0,6}\))?)'
re.findall(pattern, data["myData"].str.cat(sep=", "))
调用 re.findall()
我的输出是:
['12noon to 3:30pm', '6:30pm to 11:30pm (Mon-Sun)', '11 AM to 11 PM', '11:30 AM to 4:30 PM', '6:30 PM to 11 PM', '12 Noon to 2 AM', '11 AM to 11 PM', '10 AM to 10 PM (Mon-Thu)', '8 AM to 10:30 PM (Fri,Sat)', '12 Noon to 11 PM', '8am to 12:30AM (Mon-Sun)', '11:30 AM to 3 PM', '7 PM to 12 Midnight']