Python 使用正则表达式提取文本文件中的段落
Python extract paragraph in text file using regex
我正在使用 Python 3.7,我正在尝试使用正则表达式从一些文本文件中提取一些段落。
这是 txt 文件内容的示例。
AREA: OMBEYI MARKET, ST. RITA RAMULA
DATE: Thursday 25.03.2021, TIME: 9.00 A.M. = 5.00 P.M.
Ombeyi Mk, Kiliti Mkt, Masogo Mkt, Miwani, Kasongo, Onyango Midika, St. Rita Ramula, Onyalo
Biro, Yawo Pri, Obino, Rutek, Keyo Pri & adjacent customers.
AREA: NYAMACHE FACTORY
DATE: Thursday 25.03.2021, TIME: 830 A.M. - 3.00 P.M.
Nyamache Fact, Suguta, Gionseri, Igare, Kionduso, Nyationgongo, Enchoro, Kebuko, Emenwa, Maji
Mazuri, Borangi & adjacent customers.
AREA: SUNEKA MARKET, RIANA MARKET
DATE: Thursday 25.03.2021, TIME: 8.00 A.M. - 3.00 P.M.
Suneka Mk, Riana Mk, Kiabusura, Gesonso, Chisaro, Sugunana, Nyamira Ndogo & adjacent
customers.
AREA: ITIATI, GITUNDUTI
DATE: Thursday 25.03.2021, TIME: 9.00 A.M. = 2.00 P.M.
General China, Gachuiro, Gathuini Pri, Itiati Campus, Kianjugum, Gikore, Kihuri TBC, Gitunduti &
adjacent customers.
目前我可以使用正则表达式提取区域、日期和时间:
area_pattern = re.compile("^AREA:((.*))")
date_pattern = re.compile("^DATE:(.*),")
time_pattern = re.compile("TIME:(.*).")
我希望能够提取 DATE/TIME 之后和 AREA 之前的段落,其中包含以逗号分隔的位置。所以我将能够匹配以下内容:
1.
Ombeyi Mk, Kiliti Mkt, Masogo Mkt, Miwani, Kasongo, Onyango Midika, St. Rita Ramula, Onyalo
Biro, Yawo Pri, Obino, Rutek, Keyo Pri & adjacent customers.
2.
Nyamache Fact, Suguta, Gionseri, Igare, Kionduso, Nyationgongo, Enchoro, Kebuko, Emenwa, Maji
Mazuri, Borangi & adjacent customers.
3.
Suneka Mk, Riana Mk, Kiabusura, Gesonso, Chisaro, Sugunana, Nyamira Ndogo & adjacent
customers.
4.
General China, Gachuiro, Gathuini Pri, Itiati Campus, Kianjugum, Gikore, Kihuri TBC, Gitunduti &
adjacent customers.
如果有人可以帮助建议一个有助于此用例的正则表达式,以及对我当前正则表达式的改进,我将不胜感激。谢谢
您可以将此正则表达式与要在 re.findall
中使用的捕获组一起使用:
\nDATE:.*\n*((?:\n.*)+?)(?=\nAREA:|\Z)
正则表达式详细信息:
\nDATE:
:匹配文本DATE:
后匹配一个换行符
.*\n*
:匹配行的其余部分后跟 0 个或多个换行符
((?:\n.*)+?)
: 捕获组 1 以捕获我们的文本,其中 1 或所有行直到满足下一个条件
(?=\nAREA:|\Z)
:断言我们在当前位置 之前有一个换行符后跟AREA:
或输入结束
作为替代模式:
^DATE:.*((?:\n(?!AREA:).*)+)
^DATE:.*
匹配 DATE:
和行的其余部分
(
捕获 组 1
(?:\n(?!AREA:).*)+
重复不以 AREA:
开头的 1+ 行
)
关闭组 1
我正在使用 Python 3.7,我正在尝试使用正则表达式从一些文本文件中提取一些段落。
这是 txt 文件内容的示例。
AREA: OMBEYI MARKET, ST. RITA RAMULA
DATE: Thursday 25.03.2021, TIME: 9.00 A.M. = 5.00 P.M.
Ombeyi Mk, Kiliti Mkt, Masogo Mkt, Miwani, Kasongo, Onyango Midika, St. Rita Ramula, Onyalo
Biro, Yawo Pri, Obino, Rutek, Keyo Pri & adjacent customers.
AREA: NYAMACHE FACTORY
DATE: Thursday 25.03.2021, TIME: 830 A.M. - 3.00 P.M.
Nyamache Fact, Suguta, Gionseri, Igare, Kionduso, Nyationgongo, Enchoro, Kebuko, Emenwa, Maji
Mazuri, Borangi & adjacent customers.
AREA: SUNEKA MARKET, RIANA MARKET
DATE: Thursday 25.03.2021, TIME: 8.00 A.M. - 3.00 P.M.
Suneka Mk, Riana Mk, Kiabusura, Gesonso, Chisaro, Sugunana, Nyamira Ndogo & adjacent
customers.
AREA: ITIATI, GITUNDUTI
DATE: Thursday 25.03.2021, TIME: 9.00 A.M. = 2.00 P.M.
General China, Gachuiro, Gathuini Pri, Itiati Campus, Kianjugum, Gikore, Kihuri TBC, Gitunduti &
adjacent customers.
目前我可以使用正则表达式提取区域、日期和时间:
area_pattern = re.compile("^AREA:((.*))")
date_pattern = re.compile("^DATE:(.*),")
time_pattern = re.compile("TIME:(.*).")
我希望能够提取 DATE/TIME 之后和 AREA 之前的段落,其中包含以逗号分隔的位置。所以我将能够匹配以下内容:
1.
Ombeyi Mk, Kiliti Mkt, Masogo Mkt, Miwani, Kasongo, Onyango Midika, St. Rita Ramula, Onyalo
Biro, Yawo Pri, Obino, Rutek, Keyo Pri & adjacent customers.
2.
Nyamache Fact, Suguta, Gionseri, Igare, Kionduso, Nyationgongo, Enchoro, Kebuko, Emenwa, Maji
Mazuri, Borangi & adjacent customers.
3.
Suneka Mk, Riana Mk, Kiabusura, Gesonso, Chisaro, Sugunana, Nyamira Ndogo & adjacent
customers.
4.
General China, Gachuiro, Gathuini Pri, Itiati Campus, Kianjugum, Gikore, Kihuri TBC, Gitunduti &
adjacent customers.
如果有人可以帮助建议一个有助于此用例的正则表达式,以及对我当前正则表达式的改进,我将不胜感激。谢谢
您可以将此正则表达式与要在 re.findall
中使用的捕获组一起使用:
\nDATE:.*\n*((?:\n.*)+?)(?=\nAREA:|\Z)
正则表达式详细信息:
\nDATE:
:匹配文本DATE:
后匹配一个换行符.*\n*
:匹配行的其余部分后跟 0 个或多个换行符((?:\n.*)+?)
: 捕获组 1 以捕获我们的文本,其中 1 或所有行直到满足下一个条件(?=\nAREA:|\Z)
:断言我们在当前位置 之前有一个换行符后跟
AREA:
或输入结束
作为替代模式:
^DATE:.*((?:\n(?!AREA:).*)+)
^DATE:.*
匹配DATE:
和行的其余部分(
捕获 组 1(?:\n(?!AREA:).*)+
重复不以AREA:
开头的 1+ 行
)
关闭组 1