使用 python 的正则表达式提取数据
Extract data with regular expresion with python
我正在尝试使用 python 从 txt 文件中提取数据(请参阅下面的示例文本)。考虑到标题可以在一行中,分成两行,甚至可以在中间用一个空行分开 (TITLE1)。
我想要实现的是提取信息以存储在 table 中,如下所示:
Code
Title
Opening date
Deadline
Budget
TITLE-SDFSD-DFDS-SFDS-01-01
This is the title 1 that is split in two lines with a blank line in the middle
15-Apr-21
26-Aug-21
EUR 20.00 million
TITLE-SDFSD-DFDS-SFDS-01-02
This is the title2 in one single line
15-Mar-21
17-Aug-21
EUR 15.00 million
TITLE-SDFSD-DFDS-SFDS-01-03
This is the title3 that is too long and takes two lines
15-May-21
26-Sep-21
EUR 5.00 million
我设法用这段代码识别“代码标题”:
import re
with open('doubt2.txt','r', encoding="utf-8") as f:
f_contents = f.read()
pattern = re.compile(r'TITLE-.+-[0-9]{2}-[0-9]{2}(?!,)\S{1}')
matches = pattern.finditer(f_contents)
for match in matches:
print(match)
我得到了这个结果:
<re.Match object; span=(160, 188), match='TITLE-SDFSD-DFDS-SFDS-01-01:'>
<re.Match object; span=(669, 697), match='TITLE-SDFSD-DFDS-SFDS-01-02;'>
<re.Match object; span=(1066, 1094), match='TITLE-SDFSD-DFDS-SFDS-01-03:'>
我的疑问是如何获取我用正则表达式识别的信息并提取其余数据。你能帮帮我吗?
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Etiam id diam
posuere, eleifend diam at, condimentum justo. Pellentesque mollis a
diam id consequat.
TITLE-SDFSD-DFDS-SFDS-01-01: This is the title 1 that
is split into two lines with a blank line in the middle
Conditions Pellentesque blandit scelerisque pellentesque. Sed nec quam
purus. Quisque nec tellus sed neque accumsan lacinia sit amet sit amet
tellus. Etiam venenatis nibh vel pellentesque elementum. Nullam eget
tortor quam. Morbi sed leo et arcu aliquet luctus.
Opening date 15 Apr 2021
Deadline 26 Aug 2021
Indicative budget: The total indicative budget for the topic is EUR
20.00 million.
TITLE-SDFSD-DFDS-SFDS-01-02; This is the title2 in one single line
Conditions Cras egestas consectetur sapien at dignissim. Maecenas
commodo purus nibh, a tempus augue vestibulum feugiat. Vestibulum
dolor neque, sagittis ut tortor et, lobortis faucibus quam.
Opening date 15 March 2021
Deadline 17 Aug 2021
Indicative budget: The total indicative budget for the topic is EUR
15.00 million.
TITLE-SDFSD-DFDS-SFDS-01-03: This is the title3 that is too long and takes two lines
Conditions Cras egestas consectetur sapien at dignissim. Maecenas
commodo purus nibh, a tempus augue vestibulum feugiat. Vestibulum
dolor neque, sagittis ut tortor et, lobortis faucibus quam.
Opening date 15 May 2021
Deadline 26 Sep 2021
Indicative budget: The total indicative budget for the topic is EUR
5.00 million.
您可以使用捕获组获取匹配项。
注意可以把(?!,)\S
写成[^\s,]
基于示例中的行:
^(TITLE-.+?-[0-9]{2}-[0-9]{2})[^\s,] (.*(?:\r?\n(?![A-Z]).*)*)(?:\r?\n(?!Opening).*)*\r?\nOpening date (\d+ .*)(?:\r?\n(?!Deadline).*)*\r?\nDeadline (\d+ .*)(?:\r?\n(?!Indicative budget:).*)*\r?\nIndicative budget: .*?(EUR \d+(?:\.\d+)? \w+)
说明
^
字符串开头
(TITLE-.+?-[0-9]{2}-[0-9]{2})
捕获第1组,匹配标题部分
[^\s,]
匹配任何非白色字符space 除了逗号
(.*(?:\r?\n(?![A-Z]).*)*)
捕获 组 2,匹配所有不以大写字符开头的行
(?:\r?\n(?!Opening).*)*\r?\nOpening date
匹配所有行直到 开放日期
(\d+ .*)
捕获 组 3,匹配 1+ 个数字,一个 space 和行的其余部分
(?:\r?\n(?!Deadline).*)*\r?\nDeadline
匹配所有行直到 截止日期
(\d+ .*)
捕获 组 4,匹配 1+ 个数字和行的其余部分
(?:\r?\n(?!Indicative budget:).*)*\r?\nIndicative budget: .*?
匹配所有行直到 指示性预算:
(EUR \d+(?:\.\d+)? \w+)
捕获 组 5,匹配 EUR,数字和 1+ 个单词字符
然后您可以将其加载到 table 或数据帧
中
with open('doubt2.txt','r', encoding="utf-8") as f:
f_contents = f.read()
pattern = re.compile(r"^(TITLE-.+?-[0-9]{2}-[0-9]{2})[^\s,] (.*(?:\r?\n(?![A-Z]).*)*)(?:\r?\n(?!Opening).*)*\r?\nOpening date (\d+ .*)(?:\r?\n(?!Deadline).*)*\r?\nDeadline (\d+ .*)(?:\r?\n(?!Indicative budget:).*)*\r?\nIndicative budget: .*?(EUR \d+(?:\.\d+)? \w+)", re.MULTILINE)
matches = pattern.findall(f_contents)
df = pd.DataFrame(matches, columns = ['Code', 'Title', 'Opening date', 'Deadline', 'Budget'])
df['Title'] = df['Title'].str.replace('[\r\n]+',' ')
print(df)
输出
Code Title Opening date Deadline Budget
0 TITLE-SDFS... This is th... 15 Apr 2021 26 Aug 2021 EUR 20.00 ...
1 TITLE-SDFS... This is th... 15 March 2021 17 Aug 2021 EUR 15.00 ...
2 TITLE-SDFS... This is th... 15 May 2021 26 Sep 2021 EUR 5.00 m...
使用带捕获组的正则表达式。使用 re.DOTALL
标志允许 .*
跨多行匹配,因此您可以捕获 multi-line 标题。并使用惰性量词来避免匹配太长。
import csv
import re
pattern = re.compile(r'^(TITLE-.+?-\d{2}-\d{2})\S*\s*(.*?)^Conditions.*?^Opening date (\d{1,2} \w+ \d{4})\s*?^Deadline (\d{1,2} \w+ \d{4})\s*^Indicative budget:.*?(EUR [\d.]+ \w+)', re.MULTILINE | re.DOTALL)
matches = pattern.finditer(f_contents)
with open("result.csv", "w") as outfile:
csvfile = csv.writer(outfile)
csvfile.writerow(['Code', 'Title', 'Opening date', 'Deadline', 'Budget'])
for match in matches:
csvfile.writerow([match.group(1), match.group(2).replace('\n', ' '), match.group(3), match.group(4), match.group(5)])
我正在尝试使用 python 从 txt 文件中提取数据(请参阅下面的示例文本)。考虑到标题可以在一行中,分成两行,甚至可以在中间用一个空行分开 (TITLE1)。
我想要实现的是提取信息以存储在 table 中,如下所示:
Code | Title | Opening date | Deadline | Budget |
---|---|---|---|---|
TITLE-SDFSD-DFDS-SFDS-01-01 | This is the title 1 that is split in two lines with a blank line in the middle | 15-Apr-21 | 26-Aug-21 | EUR 20.00 million |
TITLE-SDFSD-DFDS-SFDS-01-02 | This is the title2 in one single line | 15-Mar-21 | 17-Aug-21 | EUR 15.00 million |
TITLE-SDFSD-DFDS-SFDS-01-03 | This is the title3 that is too long and takes two lines | 15-May-21 | 26-Sep-21 | EUR 5.00 million |
我设法用这段代码识别“代码标题”:
import re
with open('doubt2.txt','r', encoding="utf-8") as f:
f_contents = f.read()
pattern = re.compile(r'TITLE-.+-[0-9]{2}-[0-9]{2}(?!,)\S{1}')
matches = pattern.finditer(f_contents)
for match in matches:
print(match)
我得到了这个结果:
<re.Match object; span=(160, 188), match='TITLE-SDFSD-DFDS-SFDS-01-01:'>
<re.Match object; span=(669, 697), match='TITLE-SDFSD-DFDS-SFDS-01-02;'>
<re.Match object; span=(1066, 1094), match='TITLE-SDFSD-DFDS-SFDS-01-03:'>
我的疑问是如何获取我用正则表达式识别的信息并提取其余数据。你能帮帮我吗?
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Etiam id diam posuere, eleifend diam at, condimentum justo. Pellentesque mollis a diam id consequat.
TITLE-SDFSD-DFDS-SFDS-01-01: This is the title 1 that
is split into two lines with a blank line in the middle
Conditions Pellentesque blandit scelerisque pellentesque. Sed nec quam purus. Quisque nec tellus sed neque accumsan lacinia sit amet sit amet tellus. Etiam venenatis nibh vel pellentesque elementum. Nullam eget tortor quam. Morbi sed leo et arcu aliquet luctus.
Opening date 15 Apr 2021
Deadline 26 Aug 2021
Indicative budget: The total indicative budget for the topic is EUR 20.00 million.
TITLE-SDFSD-DFDS-SFDS-01-02; This is the title2 in one single line
Conditions Cras egestas consectetur sapien at dignissim. Maecenas commodo purus nibh, a tempus augue vestibulum feugiat. Vestibulum dolor neque, sagittis ut tortor et, lobortis faucibus quam.
Opening date 15 March 2021
Deadline 17 Aug 2021
Indicative budget: The total indicative budget for the topic is EUR 15.00 million.
TITLE-SDFSD-DFDS-SFDS-01-03: This is the title3 that is too long and takes two lines
Conditions Cras egestas consectetur sapien at dignissim. Maecenas commodo purus nibh, a tempus augue vestibulum feugiat. Vestibulum dolor neque, sagittis ut tortor et, lobortis faucibus quam.
Opening date 15 May 2021
Deadline 26 Sep 2021
Indicative budget: The total indicative budget for the topic is EUR 5.00 million.
您可以使用捕获组获取匹配项。
注意可以把(?!,)\S
写成[^\s,]
基于示例中的行:
^(TITLE-.+?-[0-9]{2}-[0-9]{2})[^\s,] (.*(?:\r?\n(?![A-Z]).*)*)(?:\r?\n(?!Opening).*)*\r?\nOpening date (\d+ .*)(?:\r?\n(?!Deadline).*)*\r?\nDeadline (\d+ .*)(?:\r?\n(?!Indicative budget:).*)*\r?\nIndicative budget: .*?(EUR \d+(?:\.\d+)? \w+)
说明
^
字符串开头(TITLE-.+?-[0-9]{2}-[0-9]{2})
捕获第1组,匹配标题部分[^\s,]
匹配任何非白色字符space 除了逗号(.*(?:\r?\n(?![A-Z]).*)*)
捕获 组 2,匹配所有不以大写字符开头的行(?:\r?\n(?!Opening).*)*\r?\nOpening date
匹配所有行直到 开放日期(\d+ .*)
捕获 组 3,匹配 1+ 个数字,一个 space 和行的其余部分(?:\r?\n(?!Deadline).*)*\r?\nDeadline
匹配所有行直到 截止日期(\d+ .*)
捕获 组 4,匹配 1+ 个数字和行的其余部分(?:\r?\n(?!Indicative budget:).*)*\r?\nIndicative budget: .*?
匹配所有行直到 指示性预算:(EUR \d+(?:\.\d+)? \w+)
捕获 组 5,匹配 EUR,数字和 1+ 个单词字符
然后您可以将其加载到 table 或数据帧
中with open('doubt2.txt','r', encoding="utf-8") as f:
f_contents = f.read()
pattern = re.compile(r"^(TITLE-.+?-[0-9]{2}-[0-9]{2})[^\s,] (.*(?:\r?\n(?![A-Z]).*)*)(?:\r?\n(?!Opening).*)*\r?\nOpening date (\d+ .*)(?:\r?\n(?!Deadline).*)*\r?\nDeadline (\d+ .*)(?:\r?\n(?!Indicative budget:).*)*\r?\nIndicative budget: .*?(EUR \d+(?:\.\d+)? \w+)", re.MULTILINE)
matches = pattern.findall(f_contents)
df = pd.DataFrame(matches, columns = ['Code', 'Title', 'Opening date', 'Deadline', 'Budget'])
df['Title'] = df['Title'].str.replace('[\r\n]+',' ')
print(df)
输出
Code Title Opening date Deadline Budget
0 TITLE-SDFS... This is th... 15 Apr 2021 26 Aug 2021 EUR 20.00 ...
1 TITLE-SDFS... This is th... 15 March 2021 17 Aug 2021 EUR 15.00 ...
2 TITLE-SDFS... This is th... 15 May 2021 26 Sep 2021 EUR 5.00 m...
使用带捕获组的正则表达式。使用 re.DOTALL
标志允许 .*
跨多行匹配,因此您可以捕获 multi-line 标题。并使用惰性量词来避免匹配太长。
import csv
import re
pattern = re.compile(r'^(TITLE-.+?-\d{2}-\d{2})\S*\s*(.*?)^Conditions.*?^Opening date (\d{1,2} \w+ \d{4})\s*?^Deadline (\d{1,2} \w+ \d{4})\s*^Indicative budget:.*?(EUR [\d.]+ \w+)', re.MULTILINE | re.DOTALL)
matches = pattern.finditer(f_contents)
with open("result.csv", "w") as outfile:
csvfile = csv.writer(outfile)
csvfile.writerow(['Code', 'Title', 'Opening date', 'Deadline', 'Budget'])
for match in matches:
csvfile.writerow([match.group(1), match.group(2).replace('\n', ' '), match.group(3), match.group(4), match.group(5)])