从文本文件中提取文本块的正则表达式?
Regular expression to extract chunks of text from a text file?
我需要使用正则表达式从 Python 中的文本文件中提取标题及其下方的文本块,但我发现这很难。
我将此 PDF 转换为文本,现在它看起来像这样:
到目前为止,我已经能够使用以下正则表达式获得所有数值 headers(12.4.5.4、12.4.5.6、13、13.1、13.1.1、13.1.12):
import re
with open('data/single.txt', encoding='UTF-8') as file:
for line in file:
headings = re.findall(r'^\d+(?:\.\d+)*\.?', line)
print(headings)`
我只是不知道如何获取这些标题的措辞部分或它们下面的文本段落。
编辑 - 这是文本:
I.S。 EN 60601-1:2006&A1:2013&AC:2014&A12:2014
60601-1 © IEC:2005
60601-1 © IEC:2005
– 337 –
– 169 –
12.4.5.4 其他产生诊断或治疗辐射的 ME 设备
适用时,制造商应在风险管理过程中解决
与产生诊断或治疗辐射的 ME 设备相关的风险
用于诊断 X-rays 和放射治疗(见 12.4.5.2 和 12.4.5.3)。
通过检查风险管理文件来检查合规性。
12.4.6 诊断或治疗声压
适用时,制造商应在风险管理过程中解决
与诊断或治疗声压相关的风险。
通过检查风险管理文件来检查合规性。
13 * 危险情况和故障情况
13.1 特定危险情况
- 一般
13.1.1
当应用 4.7 中描述和 13.2 中列出的单一故障条件时,一个在
13.1.2 至 13.1.4(含)中的危险情况 none 应发生在
我的设备。
一次任何一个组件的故障都可能导致危险情况,这是
在 4.7 中描述。
- 排放、外壳变形或超过最高温度
13.1.2
不得发生下列危险情况:
– 火焰、熔融金属、有毒或可燃物质在危险环境中的排放
数量;
– 外壳变形到不符合 15.3.1 的程度;
–
应用部分的温度超过 Table 24 中确定的允许值时
按 11.1.3 所述测量;
不是应用部分但可能是应用部分的 ME 设备部件的温度
触摸,超过Table 23 中的允许值,当测量和调整为
11.1.3 中描述;
–
– 超过 Table 中确定的“其他组件和材料”的允许值 22
乘以 1.5 减去 12.5 °C。在 Table 26、Table 27 和 Table 31 中可以找到绕组的限制。
在所有其他情况下,Table 22 的允许值适用。
应使用 11.1.3 中描述的方法测量温度。
4.7、8.1 b)、8.7.2 和 13.2.2 中的单一故障条件,关于发射
火焰、熔融金属或可燃物质,不得应用于零部件
在哪里:
– 构造或供电电路限制了单一故障中的功耗
CONDITION 小于 15 W 或能量耗散小于 900 J。
也许,
^(\d+(?:\.\d+)*)\s+([\s\S]*?)(?=^\d+(?:\.\d+)*)|^(\d+(?:\.\d+)*)\s+([\s\S]*)
可能有点接近我猜测的那些想要的文本。
这里我们只查找以
开头的行
^(\d+(?:\.\d+)*)\s+
然后,我们之后使用
简单地收集任何东西
([\s\S]*?)
直到下一行开始,
(?=^\d+(?:\.\d+)*)
然后,我们可能会也可能不会,这取决于我们的输入看起来如何,只剩下最后一个元素,我们将使用最后一个元素收集它:
^(\d+(?:\.\d+)*)\s+([\s\S]*)
然后我们将(使用 |
)更改为先前的表达式。
尽管如此,此方法易于编码,但由于我们使用环视,因此在性能方面相当慢,因此 更好,如果时间复杂度是一个问题,这很可能是。
Demo 1
测试
import re
regex = r"^(\d+(?:\.\d+)*)\s+([\s\S]*?)(?=^\d+(?:\.\d+)*)|^(\d+(?:\.\d+)*)\s+([\s\S]*)"
string = """
I.S. EN 60601-1:2006&A1:2013&AC:2014&A12:2014
60601-1 © IEC:2005
60601-1 © IEC:2005
– 337 –
– 169 –
12.4.5.4 Other ME EQUIPMENT producing diagnostic or therapeutic radiation
When applicable, the MANUFACTURER shall address in the RISK MANAGEMENT PROCESS the
RISKS associated with ME EQUIPMENT producing diagnostic or therapeutic radiation other than
for diagnostic X-rays and radiotherapy (see 12.4.5.2 and 12.4.5.3).
Compliance is checked by inspection of the RISK MANAGEMENT FILE.
12.4.6 Diagnostic or therapeutic acoustic pressure
When applicable, the MANUFACTURER shall address in the RISK MANAGEMENT PROCESS the
RISKS associated with diagnostic or therapeutic acoustic pressure.
Compliance is checked by inspection of the RISK MANAGEMENT FILE.
13 * HAZARDOUS SITUATIONS and fault conditions
13.1 Specific HAZARDOUS SITUATIONS
* General
13.1.1
When applying the SINGLE FAULT CONDITIONS as described in 4.7 and listed in 13.2, one at a
time, none of the HAZARDOUS SITUATIONS in 13.1.2 to 13.1.4 (inclusive) shall occur in the
ME EQUIPMENT.
The failure of any one component at a time, which could result in a HAZARDOUS SITUATION, is
described in 4.7.
* Emissions, deformation of ENCLOSURE or exceeding maximum temperature
13.1.2
The following HAZARDOUS SITUATIONS shall not occur:
– emission of flames, molten metal, poisonous or ignitable substance in hazardous
quantities;
– deformation of ENCLOSURES to such an extent that compliance with 15.3.1 is impaired;
–
temperatures of APPLIED PARTS exceeding the allowed values identified in Table 24 when
measured as described in 11.1.3;
temperatures of ME EQUIPMENT parts that are not APPLIED PARTS but are likely to be
touched, exceeding the allowable values in Table 23 when measured and adjusted as
described in 11.1.3;
–
– exceeding the allowable values for “other components and materials” identified in Table 22
times 1,5 minus 12,5 °C. Limits for windings are found in Table 26, Table 27 and Table 31.
In all other cases, the allowable values of Table 22 apply.
Temperatures shall be measured using the method described in 11.1.3.
The SINGLE FAULT CONDITIONS in 4.7, 8.1 b), 8.7.2 and 13.2.2, with regard to the emission of
flames, molten metal or ignitable substances, shall not be applied to parts and components
where:
– The construction or the supply circuit limits the power dissipation in SINGLE FAULT
CONDITION to less than 15 W or the energy dissipation to less than 900 J.
"""
print(re.findall(regex, string, re.M))
输出
[('12.4.5.4', 'Other ME EQUIPMENT producing diagnostic or therapeutic
radiation \nWhen applicable, the MANUFACTURER shall address in
the RISK MANAGEMENT PROCESS the \nRISKS associated with ME
EQUIPMENT producing diagnostic or therapeutic radiation other than
\nfor diagnostic X-rays and radiotherapy (see 12.4.5.2 and 12.4.5.3).
\n\nCompliance is checked by inspection of the RISK MANAGEMENT
FILE.\n\n', '', ''), ('12.4.6', 'Diagnostic or therapeutic acoustic
pressure \nWhen applicable, the MANUFACTURER shall address in
the RISK MANAGEMENT PROCESS the \nRISKS associated with diagnostic
or therapeutic acoustic pressure. \n\nCompliance is checked by
inspection of the RISK MANAGEMENT FILE.\n\n', '', ''), ('13', '*
HAZARDOUS SITUATIONS and fault conditions\n\n', '', ''), ('13.1',
'Specific HAZARDOUS SITUATIONS\n\n* General \n\n', '', ''),
('13.1.1', 'When applying the SINGLE FAULT CONDITIONS as
described in 4.7 and listed in 13.2, one at a \ntime, none
of the HAZARDOUS SITUATIONS in 13.1.2 to 13.1.4 (inclusive)
shall occur in the \nME EQUIPMENT.\n\nThe failure of any one
component at a time, which could result in a HAZARDOUS SITUATION, is
\ndescribed in 4.7. \n\n* Emissions, deformation of ENCLOSURE or
exceeding maximum temperature \n\n', '', ''), ('', '', '13.1.2', 'The
following HAZARDOUS SITUATIONS shall not occur: \n– emission of
flames, molten metal, poisonous or ignitable substance in
hazardous \n\nquantities; \n\n– deformation of ENCLOSURES to such an
extent that compliance with 15.3.1 is impaired; \n– \n\ntemperatures
of APPLIED PARTS exceeding the allowed values identified in
Table 24 when \nmeasured as described in 11.1.3; \ntemperatures of
ME EQUIPMENT parts that are not APPLIED PARTS but are likely
to be \ntouched, exceeding the allowable values in Table 23
when measured and adjusted as \ndescribed in 11.1.3; \n\n– \n\n–
exceeding the allowable values for “other components and materials”
identified in Table 22 \ntimes 1,5 minus 12,5 °C. Limits for windings
are found in Table 26, Table 27 and Table 31. \nIn all other cases,
the allowable values of Table 22 apply. \n\nTemperatures shall be
measured using the method described in 11.1.3. \n\nThe SINGLE FAULT
CONDITIONS in 4.7, 8.1 b), 8.7.2 and 13.2.2, with regard to
the emission of \nflames, molten metal or ignitable substances,
shall not be applied to parts and components \nwhere: \n– The
construction or the supply circuit limits the power
dissipation in SINGLE FAULT \n\nCONDITION to less than 15 W or the
energy dissipation to less than 900 J. \n\n')]
您可以使用您的模式并匹配 space 后跟该行的其余部分。
然后重复匹配以下不以标题开头的所有行。
^\d+(?:\.\d+)* .*(?:\r?\n(?!\d+(?:\.\d+)* ).*)*
^\d+(?:.\d+)*
匹配标题后跟 space 的模式
.*
匹配除换行符以外的任何字符 0+ 次
(?:
非捕获组
\r?\n
匹配一个换行符
(?!
否定前瞻,断言直接在右边的不是
\d+(?:.\d+)*
标题模式
)
关闭前瞻
.*
匹配除换行符以外的任何字符 0+ 次
)*
关闭非捕获组,重复0+次匹配所有行
感谢他们详细的回答和有用的解释,我最终将@The-fourth-bird 的代码和@Emma 的代码的部分内容合并到这个正则表达式中,它似乎可以很好地满足我的需要。
(^\d+(?:\.\d+)*\s+)((?![a-z])[\s\S].*(?:\r?\n))([\s\S]*?)(?=^\d+(?:\.\d+)*\s+(?![a-z]))
这里是REGEX DEMO。
我做我想做的,即将(数字标题)、(文字标题)和(文本的body)分成由逗号分隔的组,这样我就可以将它们分成 Excel 通过使用自定义分隔符 ), ( 和其他一些 post 处理。
这个新正则表达式的好处是它会跳过编号标题,这些标题只是参考,而不是实际标题,如下所示:
import pdfplumber
import re
pdfToString = ""
with pdfplumber.open(r"sample.pdf") as pdf:
for page in pdf.pages:
print(page.extract_text())
pdfToString += page.extract_text()
matches = re.findall(r'^\d+(?:\.\d+)* .*(?:\r?\n(?!\d+(?:\.\d+)* ).*)*',pdfToString, re.M)
for i in matches:
if "word_to_extract" in i[:50]:
print(i)
这个解决方案是提取问题中标题格式相同的所有标题,并提取所需的标题及其后面的段落。
我需要使用正则表达式从 Python 中的文本文件中提取标题及其下方的文本块,但我发现这很难。
我将此 PDF 转换为文本,现在它看起来像这样:
到目前为止,我已经能够使用以下正则表达式获得所有数值 headers(12.4.5.4、12.4.5.6、13、13.1、13.1.1、13.1.12):
import re
with open('data/single.txt', encoding='UTF-8') as file:
for line in file:
headings = re.findall(r'^\d+(?:\.\d+)*\.?', line)
print(headings)`
我只是不知道如何获取这些标题的措辞部分或它们下面的文本段落。
编辑 - 这是文本:
I.S。 EN 60601-1:2006&A1:2013&AC:2014&A12:2014
60601-1 © IEC:2005 60601-1 © IEC:2005
– 337 – – 169 –
12.4.5.4 其他产生诊断或治疗辐射的 ME 设备 适用时,制造商应在风险管理过程中解决 与产生诊断或治疗辐射的 ME 设备相关的风险 用于诊断 X-rays 和放射治疗(见 12.4.5.2 和 12.4.5.3)。
通过检查风险管理文件来检查合规性。
12.4.6 诊断或治疗声压 适用时,制造商应在风险管理过程中解决 与诊断或治疗声压相关的风险。
通过检查风险管理文件来检查合规性。
13 * 危险情况和故障情况
13.1 特定危险情况
- 一般
13.1.1 当应用 4.7 中描述和 13.2 中列出的单一故障条件时,一个在 13.1.2 至 13.1.4(含)中的危险情况 none 应发生在 我的设备。
一次任何一个组件的故障都可能导致危险情况,这是 在 4.7 中描述。
- 排放、外壳变形或超过最高温度
13.1.2 不得发生下列危险情况: – 火焰、熔融金属、有毒或可燃物质在危险环境中的排放
数量;
– 外壳变形到不符合 15.3.1 的程度; –
应用部分的温度超过 Table 24 中确定的允许值时 按 11.1.3 所述测量; 不是应用部分但可能是应用部分的 ME 设备部件的温度 触摸,超过Table 23 中的允许值,当测量和调整为 11.1.3 中描述;
–
– 超过 Table 中确定的“其他组件和材料”的允许值 22 乘以 1.5 减去 12.5 °C。在 Table 26、Table 27 和 Table 31 中可以找到绕组的限制。 在所有其他情况下,Table 22 的允许值适用。
应使用 11.1.3 中描述的方法测量温度。
4.7、8.1 b)、8.7.2 和 13.2.2 中的单一故障条件,关于发射 火焰、熔融金属或可燃物质,不得应用于零部件 在哪里: – 构造或供电电路限制了单一故障中的功耗
CONDITION 小于 15 W 或能量耗散小于 900 J。
也许,
^(\d+(?:\.\d+)*)\s+([\s\S]*?)(?=^\d+(?:\.\d+)*)|^(\d+(?:\.\d+)*)\s+([\s\S]*)
可能有点接近我猜测的那些想要的文本。
这里我们只查找以
开头的行^(\d+(?:\.\d+)*)\s+
然后,我们之后使用
简单地收集任何东西([\s\S]*?)
直到下一行开始,
(?=^\d+(?:\.\d+)*)
然后,我们可能会也可能不会,这取决于我们的输入看起来如何,只剩下最后一个元素,我们将使用最后一个元素收集它:
^(\d+(?:\.\d+)*)\s+([\s\S]*)
然后我们将(使用 |
)更改为先前的表达式。
尽管如此,此方法易于编码,但由于我们使用环视,因此在性能方面相当慢,因此
Demo 1
测试
import re
regex = r"^(\d+(?:\.\d+)*)\s+([\s\S]*?)(?=^\d+(?:\.\d+)*)|^(\d+(?:\.\d+)*)\s+([\s\S]*)"
string = """
I.S. EN 60601-1:2006&A1:2013&AC:2014&A12:2014
60601-1 © IEC:2005
60601-1 © IEC:2005
– 337 –
– 169 –
12.4.5.4 Other ME EQUIPMENT producing diagnostic or therapeutic radiation
When applicable, the MANUFACTURER shall address in the RISK MANAGEMENT PROCESS the
RISKS associated with ME EQUIPMENT producing diagnostic or therapeutic radiation other than
for diagnostic X-rays and radiotherapy (see 12.4.5.2 and 12.4.5.3).
Compliance is checked by inspection of the RISK MANAGEMENT FILE.
12.4.6 Diagnostic or therapeutic acoustic pressure
When applicable, the MANUFACTURER shall address in the RISK MANAGEMENT PROCESS the
RISKS associated with diagnostic or therapeutic acoustic pressure.
Compliance is checked by inspection of the RISK MANAGEMENT FILE.
13 * HAZARDOUS SITUATIONS and fault conditions
13.1 Specific HAZARDOUS SITUATIONS
* General
13.1.1
When applying the SINGLE FAULT CONDITIONS as described in 4.7 and listed in 13.2, one at a
time, none of the HAZARDOUS SITUATIONS in 13.1.2 to 13.1.4 (inclusive) shall occur in the
ME EQUIPMENT.
The failure of any one component at a time, which could result in a HAZARDOUS SITUATION, is
described in 4.7.
* Emissions, deformation of ENCLOSURE or exceeding maximum temperature
13.1.2
The following HAZARDOUS SITUATIONS shall not occur:
– emission of flames, molten metal, poisonous or ignitable substance in hazardous
quantities;
– deformation of ENCLOSURES to such an extent that compliance with 15.3.1 is impaired;
–
temperatures of APPLIED PARTS exceeding the allowed values identified in Table 24 when
measured as described in 11.1.3;
temperatures of ME EQUIPMENT parts that are not APPLIED PARTS but are likely to be
touched, exceeding the allowable values in Table 23 when measured and adjusted as
described in 11.1.3;
–
– exceeding the allowable values for “other components and materials” identified in Table 22
times 1,5 minus 12,5 °C. Limits for windings are found in Table 26, Table 27 and Table 31.
In all other cases, the allowable values of Table 22 apply.
Temperatures shall be measured using the method described in 11.1.3.
The SINGLE FAULT CONDITIONS in 4.7, 8.1 b), 8.7.2 and 13.2.2, with regard to the emission of
flames, molten metal or ignitable substances, shall not be applied to parts and components
where:
– The construction or the supply circuit limits the power dissipation in SINGLE FAULT
CONDITION to less than 15 W or the energy dissipation to less than 900 J.
"""
print(re.findall(regex, string, re.M))
输出
[('12.4.5.4', 'Other ME EQUIPMENT producing diagnostic or therapeutic radiation \nWhen applicable, the MANUFACTURER shall address in the RISK MANAGEMENT PROCESS the \nRISKS associated with ME EQUIPMENT producing diagnostic or therapeutic radiation other than \nfor diagnostic X-rays and radiotherapy (see 12.4.5.2 and 12.4.5.3). \n\nCompliance is checked by inspection of the RISK MANAGEMENT FILE.\n\n', '', ''), ('12.4.6', 'Diagnostic or therapeutic acoustic pressure \nWhen applicable, the MANUFACTURER shall address in the RISK MANAGEMENT PROCESS the \nRISKS associated with diagnostic or therapeutic acoustic pressure. \n\nCompliance is checked by inspection of the RISK MANAGEMENT FILE.\n\n', '', ''), ('13', '* HAZARDOUS SITUATIONS and fault conditions\n\n', '', ''), ('13.1', 'Specific HAZARDOUS SITUATIONS\n\n* General \n\n', '', ''), ('13.1.1', 'When applying the SINGLE FAULT CONDITIONS as described in 4.7 and listed in 13.2, one at a \ntime, none of the HAZARDOUS SITUATIONS in 13.1.2 to 13.1.4 (inclusive) shall occur in the \nME EQUIPMENT.\n\nThe failure of any one component at a time, which could result in a HAZARDOUS SITUATION, is \ndescribed in 4.7. \n\n* Emissions, deformation of ENCLOSURE or exceeding maximum temperature \n\n', '', ''), ('', '', '13.1.2', 'The following HAZARDOUS SITUATIONS shall not occur: \n– emission of flames, molten metal, poisonous or ignitable substance in hazardous \n\nquantities; \n\n– deformation of ENCLOSURES to such an extent that compliance with 15.3.1 is impaired; \n– \n\ntemperatures of APPLIED PARTS exceeding the allowed values identified in Table 24 when \nmeasured as described in 11.1.3; \ntemperatures of ME EQUIPMENT parts that are not APPLIED PARTS but are likely to be \ntouched, exceeding the allowable values in Table 23 when measured and adjusted as \ndescribed in 11.1.3; \n\n– \n\n– exceeding the allowable values for “other components and materials” identified in Table 22 \ntimes 1,5 minus 12,5 °C. Limits for windings are found in Table 26, Table 27 and Table 31. \nIn all other cases, the allowable values of Table 22 apply. \n\nTemperatures shall be measured using the method described in 11.1.3. \n\nThe SINGLE FAULT CONDITIONS in 4.7, 8.1 b), 8.7.2 and 13.2.2, with regard to the emission of \nflames, molten metal or ignitable substances, shall not be applied to parts and components \nwhere: \n– The construction or the supply circuit limits the power dissipation in SINGLE FAULT \n\nCONDITION to less than 15 W or the energy dissipation to less than 900 J. \n\n')]
您可以使用您的模式并匹配 space 后跟该行的其余部分。
然后重复匹配以下不以标题开头的所有行。
^\d+(?:\.\d+)* .*(?:\r?\n(?!\d+(?:\.\d+)* ).*)*
^\d+(?:.\d+)*
匹配标题后跟 space 的模式
.*
匹配除换行符以外的任何字符 0+ 次(?:
非捕获组\r?\n
匹配一个换行符(?!
否定前瞻,断言直接在右边的不是\d+(?:.\d+)*
标题模式
)
关闭前瞻.*
匹配除换行符以外的任何字符 0+ 次
)*
关闭非捕获组,重复0+次匹配所有行
感谢他们详细的回答和有用的解释,我最终将@The-fourth-bird 的代码和@Emma 的代码的部分内容合并到这个正则表达式中,它似乎可以很好地满足我的需要。
(^\d+(?:\.\d+)*\s+)((?![a-z])[\s\S].*(?:\r?\n))([\s\S]*?)(?=^\d+(?:\.\d+)*\s+(?![a-z]))
这里是REGEX DEMO。
我做我想做的,即将(数字标题)、(文字标题)和(文本的body)分成由逗号分隔的组,这样我就可以将它们分成 Excel 通过使用自定义分隔符 ), ( 和其他一些 post 处理。
这个新正则表达式的好处是它会跳过编号标题,这些标题只是参考,而不是实际标题,如下所示:
import pdfplumber
import re
pdfToString = ""
with pdfplumber.open(r"sample.pdf") as pdf:
for page in pdf.pages:
print(page.extract_text())
pdfToString += page.extract_text()
matches = re.findall(r'^\d+(?:\.\d+)* .*(?:\r?\n(?!\d+(?:\.\d+)* ).*)*',pdfToString, re.M)
for i in matches:
if "word_to_extract" in i[:50]:
print(i)
这个解决方案是提取问题中标题格式相同的所有标题,并提取所需的标题及其后面的段落。