使用 python 的正则表达式提取数据

Extract data with regular expresion with python

我正在尝试使用 python 从 txt 文件中提取数据(请参阅下面的示例文本)。考虑到标题可以在一行中,分成两行,甚至可以在中间用一个空行分开 (TITLE1)。

我想要实现的是提取信息以存储在 table 中,如下所示:

Code Title Opening date Deadline Budget
TITLE-SDFSD-DFDS-SFDS-01-01 This is the title 1 that is split in two lines with a blank line in the middle 15-Apr-21 26-Aug-21 EUR 20.00 million
TITLE-SDFSD-DFDS-SFDS-01-02 This is the title2 in one single line 15-Mar-21 17-Aug-21 EUR 15.00 million
TITLE-SDFSD-DFDS-SFDS-01-03 This is the title3 that is too long and takes two lines 15-May-21 26-Sep-21 EUR 5.00 million

我设法用这段代码识别“代码标题”:

import re

with open('doubt2.txt','r', encoding="utf-8") as f:
    f_contents = f.read() 
    
pattern = re.compile(r'TITLE-.+-[0-9]{2}-[0-9]{2}(?!,)\S{1}')
matches = pattern.finditer(f_contents)

for match in matches:
    print(match)

我得到了这个结果:

<re.Match object; span=(160, 188), match='TITLE-SDFSD-DFDS-SFDS-01-01:'>
<re.Match object; span=(669, 697), match='TITLE-SDFSD-DFDS-SFDS-01-02;'>
<re.Match object; span=(1066, 1094), match='TITLE-SDFSD-DFDS-SFDS-01-03:'>

我的疑问是如何获取我用正则表达式识别的信息并提取其余数据。你能帮帮我吗?

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Etiam id diam posuere, eleifend diam at, condimentum justo. Pellentesque mollis a diam id consequat.

TITLE-SDFSD-DFDS-SFDS-01-01: This is the title 1 that

is split into two lines with a blank line in the middle

Conditions Pellentesque blandit scelerisque pellentesque. Sed nec quam purus. Quisque nec tellus sed neque accumsan lacinia sit amet sit amet tellus. Etiam venenatis nibh vel pellentesque elementum. Nullam eget tortor quam. Morbi sed leo et arcu aliquet luctus.

Opening date 15 Apr 2021

Deadline 26 Aug 2021

Indicative budget: The total indicative budget for the topic is EUR 20.00 million.

TITLE-SDFSD-DFDS-SFDS-01-02; This is the title2 in one single line

Conditions Cras egestas consectetur sapien at dignissim. Maecenas commodo purus nibh, a tempus augue vestibulum feugiat. Vestibulum dolor neque, sagittis ut tortor et, lobortis faucibus quam.

Opening date 15 March 2021

Deadline 17 Aug 2021

Indicative budget: The total indicative budget for the topic is EUR 15.00 million.

TITLE-SDFSD-DFDS-SFDS-01-03: This is the title3 that is too long and takes two lines

Conditions Cras egestas consectetur sapien at dignissim. Maecenas commodo purus nibh, a tempus augue vestibulum feugiat. Vestibulum dolor neque, sagittis ut tortor et, lobortis faucibus quam.

Opening date 15 May 2021

Deadline 26 Sep 2021

Indicative budget: The total indicative budget for the topic is EUR 5.00 million.

您可以使用捕获组获取匹配项。

注意可以把(?!,)\S写成[^\s,]

基于示例中的行:

^(TITLE-.+?-[0-9]{2}-[0-9]{2})[^\s,] (.*(?:\r?\n(?![A-Z]).*)*)(?:\r?\n(?!Opening).*)*\r?\nOpening date (\d+ .*)(?:\r?\n(?!Deadline).*)*\r?\nDeadline (\d+ .*)(?:\r?\n(?!Indicative budget:).*)*\r?\nIndicative budget: .*?(EUR \d+(?:\.\d+)? \w+)

说明

  • ^ 字符串开头
  • (TITLE-.+?-[0-9]{2}-[0-9]{2})捕获第1组,匹配标题部分
  • [^\s,] 匹配任何非白色字符space 除了逗号
  • (.*(?:\r?\n(?![A-Z]).*)*) 捕获 组 2,匹配所有不以大写字符开头的行
  • (?:\r?\n(?!Opening).*)*\r?\nOpening date 匹配所有行直到 开放日期
  • (\d+ .*) 捕获 组 3,匹配 1+ 个数字,一个 space 和行的其余部分
  • (?:\r?\n(?!Deadline).*)*\r?\nDeadline 匹配所有行直到 截止日期
  • (\d+ .*) 捕获 组 4,匹配 1+ 个数字和行的其余部分
  • (?:\r?\n(?!Indicative budget:).*)*\r?\nIndicative budget: .*? 匹配所有行直到 指示性预算:
  • (EUR \d+(?:\.\d+)? \w+) 捕获 组 5,匹配 EUR,数字和 1+ 个单词字符

Regex demo | Python demo

然后您可以将其加载到 table 或数据帧

with open('doubt2.txt','r', encoding="utf-8") as f:
    f_contents = f.read()
    pattern = re.compile(r"^(TITLE-.+?-[0-9]{2}-[0-9]{2})[^\s,] (.*(?:\r?\n(?![A-Z]).*)*)(?:\r?\n(?!Opening).*)*\r?\nOpening date (\d+ .*)(?:\r?\n(?!Deadline).*)*\r?\nDeadline (\d+ .*)(?:\r?\n(?!Indicative budget:).*)*\r?\nIndicative budget: .*?(EUR \d+(?:\.\d+)? \w+)", re.MULTILINE)
    matches = pattern.findall(f_contents)
    df = pd.DataFrame(matches, columns = ['Code', 'Title', 'Opening date', 'Deadline', 'Budget'])
    df['Title'] = df['Title'].str.replace('[\r\n]+',' ')
    print(df)

输出

            Code          Title   Opening date     Deadline         Budget
0  TITLE-SDFS...  This is th...    15 Apr 2021  26 Aug 2021  EUR 20.00 ...
1  TITLE-SDFS...  This is th...  15 March 2021  17 Aug 2021  EUR 15.00 ...
2  TITLE-SDFS...  This is th...    15 May 2021  26 Sep 2021  EUR 5.00 m...

使用带捕获组的正则表达式。使用 re.DOTALL 标志允许 .* 跨多行匹配,因此您可以捕获 multi-line 标题。并使用惰性量词来避免匹配太长。

import csv
import re

pattern = re.compile(r'^(TITLE-.+?-\d{2}-\d{2})\S*\s*(.*?)^Conditions.*?^Opening date (\d{1,2} \w+ \d{4})\s*?^Deadline (\d{1,2} \w+ \d{4})\s*^Indicative budget:.*?(EUR [\d.]+ \w+)', re.MULTILINE | re.DOTALL)
matches = pattern.finditer(f_contents)

with open("result.csv", "w") as outfile:
    csvfile = csv.writer(outfile)
    csvfile.writerow(['Code', 'Title', 'Opening date', 'Deadline', 'Budget'])
    for match in matches:
        csvfile.writerow([match.group(1), match.group(2).replace('\n', ' '), match.group(3), match.group(4), match.group(5)])

DEMO