用于捕获具有特定模式的日期的正则表达式
RegEx for capturing dates with specific pattern
我正在尝试从多个 pdf 文件中提取数据。有一个与日期相关的数据点,其中日期之前的字符串在某些 pdf 中有所不同。我检查了各个 regex 语句是否有效,但是,当我尝试将 regex 语句组合到我的 for 循环中的一个语句时,我没有提取日期。这是我尝试匹配的字符串以及我的代码,用于在 'DATE OF BIRTHDAY':
之后提取日期信息的各个正则表达式语句
DATE OF BIRTHDAY\n01/11/2011
date_of_birthday1 = re.search('(?<=DATE OF BIRTHDAY \n)(.*)', img).groups()
DATE OF BIRTHDAY\n\n02/14/2015
date_of_birthday2 = re.search('(?<=DATE OF BIRTHDAY \n\n)(.*)', img).groups()
DATE OF BIRTHDAY GIRL \n\ni : Pll i ii\ni \n\nPll 05/07/2018
date_of_birthday3 = re.search('(?<=DATE OF BIRTHDAY GIRL \n\ni : Pll i ii\ni \n\nPll)(.*)', img).groups()
我正在尝试将这些正则表达式语句组合成一个 or 语句,以便我可以在 for 循环中使用它们,如下所示:
date_of_birthdays = re.search('(?<=DATE OF BIRTHDAY\n\n)(.*)|(?<=DATE OF BIRTHDAY\n)(.*)|(?<=DATE OF BIRTHDAY GIRL \n\ni : Pll i ii\ni \n\nPll)(.*)', img).groups
我的预期输出是
df['Birthdays'] = date_of_birthdays
看起来像这样:
df = pd.DataFrame({"Birthdays": ['01/11/2011', '02/14/2015', '05/07/2018']})
df
但是,我无法提取任何日期信息。想一想我在这里做错了什么?
这个有效
>>> import re
>>> re.findall(
... r"(?:DATE[ ]OF[ ]BIRTHDAY)(?:\n(?:\n)?|[ ]GIRL[ ]\n\ni[ ]:[ ]Pll[ ]i[ ]ii\ni[ ]\n\nPll[ ])?(.*)",
... (
... r'DATE OF BIRTHDAY\n01/11/2011' + "\n"
... r'DATE OF BIRTHDAY\n\n02/14/2015' + "\n"
... r'DATE OF BIRTHDAY GIRL \n\ni : Pll i ii\ni \n\nPll 05/07/2018' + "\n"
... ))
['01/11/2011', '02/14/2015', '05/07/2018']
>>>
正则表达式扩展
(?: DATE [ ] OF [ ] BIRTHDAY )
(?:
\ n
(?: \ n )?
| [ ] GIRL [ ] \ n \ ni [ ] : [ ] Pll [ ] i [ ] ii \ n i [ ] \ n \ n Pll [ ]
)?
( .* ) # (1)
只是公平警告,表达式带有后向断言
在这两个交替中提出问题:
(?<= DATE [ ] OF [ ] BIRTHDAY \ n \ n )
( .* ) # (1)
| (?<= DATE [ ] OF [ ] BIRTHDAY \ n )
( .* ) # (2)
很难想象,所以我就直接说出来,
捕获组1(第一个交替)永远不会匹配!!
原因是总是先检查向后较短的距离。
由于 .*
给了它一种匹配方式,较短的一个 \n
文字总是首先匹配。
您可以通过添加 (?!\n)
来强制它 不 匹配来解决这个问题
(?<= DATE [ ] OF [ ] BIRTHDAY \ n \ n )
( .* ) # (1)
| (?<= DATE [ ] OF [ ] BIRTHDAY \ n )
(?! \ n )
( .* ) # (2)
好吧,这太麻烦了,下面是
的一些基准
正在考虑的方法(这并不是真正理想的方法)
Regex1: (?:DATE[ ]OF[ ]BIRTHDAY)(?:\n(?:\n)?|[ ]GIRL[ ]\n\ni[ ]:[ ]Pll[ ]i[ ]ii\ni[ ]\n\nPll[ ])?(.*)
Options: < none >
Completed iterations: 50 / 50 ( x 1000 )
Matches found per iteration: 3
Elapsed Time: 0.29 s, 294.80 ms, 294801 µs
Matches per sec: 508,817
Regex2: (?:(?<=DATE[ ]OF[ ]BIRTHDAY\n\n)|(?<=DATE[ ]OF[ ]BIRTHDAY\n)(?!\n)|(?<=DATE[ ]OF[ ]BIRTHDAY[ ]GIRL[ ]\n\ni[ ]:[ ]Pll[ ]i[ ]ii\ni[ ]\n\nPll[ ]))(.*)
Options: < none >
Completed iterations: 50 / 50 ( x 1000 )
Matches found per iteration: 3
Elapsed Time: 2.27 s, 2268.42 ms, 2268417 µs
Matches per sec: 66,125
Regex3: (?<=DATE[ ]OF[ ]BIRTHDAY\n\n)(.*)|(?<=DATE[ ]OF[ ]BIRTHDAY\n)(?!\n)(.*)|(?<=DATE[ ]OF[ ]BIRTHDAY[ ]GIRL[ ]\n\ni[ ]:[ ]Pll[ ]i[ ]ii\ni[ ]\n\nPll[ ])(.*)
Options: < none >
Completed iterations: 50 / 50 ( x 1000 )
Matches found per iteration: 3
Elapsed Time: 2.76 s, 2760.81 ms, 2760809 µs
Matches per sec: 54,331
我正在尝试从多个 pdf 文件中提取数据。有一个与日期相关的数据点,其中日期之前的字符串在某些 pdf 中有所不同。我检查了各个 regex 语句是否有效,但是,当我尝试将 regex 语句组合到我的 for 循环中的一个语句时,我没有提取日期。这是我尝试匹配的字符串以及我的代码,用于在 'DATE OF BIRTHDAY':
之后提取日期信息的各个正则表达式语句DATE OF BIRTHDAY\n01/11/2011
date_of_birthday1 = re.search('(?<=DATE OF BIRTHDAY \n)(.*)', img).groups()
DATE OF BIRTHDAY\n\n02/14/2015
date_of_birthday2 = re.search('(?<=DATE OF BIRTHDAY \n\n)(.*)', img).groups()
DATE OF BIRTHDAY GIRL \n\ni : Pll i ii\ni \n\nPll 05/07/2018
date_of_birthday3 = re.search('(?<=DATE OF BIRTHDAY GIRL \n\ni : Pll i ii\ni \n\nPll)(.*)', img).groups()
我正在尝试将这些正则表达式语句组合成一个 or 语句,以便我可以在 for 循环中使用它们,如下所示:
date_of_birthdays = re.search('(?<=DATE OF BIRTHDAY\n\n)(.*)|(?<=DATE OF BIRTHDAY\n)(.*)|(?<=DATE OF BIRTHDAY GIRL \n\ni : Pll i ii\ni \n\nPll)(.*)', img).groups
我的预期输出是
df['Birthdays'] = date_of_birthdays
看起来像这样:
df = pd.DataFrame({"Birthdays": ['01/11/2011', '02/14/2015', '05/07/2018']})
df
但是,我无法提取任何日期信息。想一想我在这里做错了什么?
这个有效
>>> import re
>>> re.findall(
... r"(?:DATE[ ]OF[ ]BIRTHDAY)(?:\n(?:\n)?|[ ]GIRL[ ]\n\ni[ ]:[ ]Pll[ ]i[ ]ii\ni[ ]\n\nPll[ ])?(.*)",
... (
... r'DATE OF BIRTHDAY\n01/11/2011' + "\n"
... r'DATE OF BIRTHDAY\n\n02/14/2015' + "\n"
... r'DATE OF BIRTHDAY GIRL \n\ni : Pll i ii\ni \n\nPll 05/07/2018' + "\n"
... ))
['01/11/2011', '02/14/2015', '05/07/2018']
>>>
正则表达式扩展
(?: DATE [ ] OF [ ] BIRTHDAY )
(?:
\ n
(?: \ n )?
| [ ] GIRL [ ] \ n \ ni [ ] : [ ] Pll [ ] i [ ] ii \ n i [ ] \ n \ n Pll [ ]
)?
( .* ) # (1)
只是公平警告,表达式带有后向断言
在这两个交替中提出问题:
(?<= DATE [ ] OF [ ] BIRTHDAY \ n \ n )
( .* ) # (1)
| (?<= DATE [ ] OF [ ] BIRTHDAY \ n )
( .* ) # (2)
很难想象,所以我就直接说出来,
捕获组1(第一个交替)永远不会匹配!!
原因是总是先检查向后较短的距离。
由于 .*
给了它一种匹配方式,较短的一个 \n
文字总是首先匹配。
您可以通过添加 (?!\n)
来强制它 不 匹配来解决这个问题
(?<= DATE [ ] OF [ ] BIRTHDAY \ n \ n )
( .* ) # (1)
| (?<= DATE [ ] OF [ ] BIRTHDAY \ n )
(?! \ n )
( .* ) # (2)
好吧,这太麻烦了,下面是
的一些基准
正在考虑的方法(这并不是真正理想的方法)
Regex1: (?:DATE[ ]OF[ ]BIRTHDAY)(?:\n(?:\n)?|[ ]GIRL[ ]\n\ni[ ]:[ ]Pll[ ]i[ ]ii\ni[ ]\n\nPll[ ])?(.*)
Options: < none >
Completed iterations: 50 / 50 ( x 1000 )
Matches found per iteration: 3
Elapsed Time: 0.29 s, 294.80 ms, 294801 µs
Matches per sec: 508,817
Regex2: (?:(?<=DATE[ ]OF[ ]BIRTHDAY\n\n)|(?<=DATE[ ]OF[ ]BIRTHDAY\n)(?!\n)|(?<=DATE[ ]OF[ ]BIRTHDAY[ ]GIRL[ ]\n\ni[ ]:[ ]Pll[ ]i[ ]ii\ni[ ]\n\nPll[ ]))(.*)
Options: < none >
Completed iterations: 50 / 50 ( x 1000 )
Matches found per iteration: 3
Elapsed Time: 2.27 s, 2268.42 ms, 2268417 µs
Matches per sec: 66,125
Regex3: (?<=DATE[ ]OF[ ]BIRTHDAY\n\n)(.*)|(?<=DATE[ ]OF[ ]BIRTHDAY\n)(?!\n)(.*)|(?<=DATE[ ]OF[ ]BIRTHDAY[ ]GIRL[ ]\n\ni[ ]:[ ]Pll[ ]i[ ]ii\ni[ ]\n\nPll[ ])(.*)
Options: < none >
Completed iterations: 50 / 50 ( x 1000 )
Matches found per iteration: 3
Elapsed Time: 2.76 s, 2760.81 ms, 2760809 µs
Matches per sec: 54,331