从文本文件中提取文本块的正则表达式?

Regular expression to extract chunks of text from a text file?

我需要使用正则表达式从 Python 中的文本文件中提取标题及其下方的文本块,但我发现这很难。

我将此 PDF 转换为文本,现在它看起来像这样:

到目前为止,我已经能够使用以下正则表达式获得所有数值 headers(12.4.5.4、12.4.5.6、13、13.1、13.1.1、13.1.12):

import re

with open('data/single.txt', encoding='UTF-8') as file:

    for line in file:
        headings = re.findall(r'^\d+(?:\.\d+)*\.?', line)
        print(headings)`

我只是不知道如何获取这些标题的措辞部分或它们下面的文本段落。

编辑 - 这是文本:

I.S。 EN 60601-1:2006&A1:2013&AC:2014&A12:2014

60601-1 © IEC:2005 60601-1 © IEC:2005

– 337 – – 169 –

12.4.5.4 其他产生诊断或治疗辐射的 ME 设备 适用时,制造商应在风险管理过程中解决 与产生诊断或治疗辐射的 ME 设备相关的风险 用于诊断 X-rays 和放射治疗(见 12.4.5.2 和 12.4.5.3)。

通过检查风险管理文件来检查合规性。

12.4.6 诊断或治疗声压 适用时,制造商应在风险管理过程中解决 与诊断或治疗声压相关的风险。

通过检查风险管理文件来检查合规性。

13 * 危险情况和故障情况

13.1 特定危险情况

13.1.1 当应用 4.7 中描述和 13.2 中列出的单一故障条件时,一个在 13.1.2 至 13.1.4(含)中的危险情况 none 应发生在 我的设备。

一次任何一个组件的故障都可能导致危险情况,这是 在 4.7 中描述。

13.1.2 不得发生下列危险情况: – 火焰、熔融金属、有毒或可燃物质在危险环境中的排放

数量;

– 外壳变形到不符合 15.3.1 的程度; –

应用部分的温度超过 Table 24 中确定的允许值时 按 11.1.3 所述测量; 不是应用部分但可能是应用部分的 ME 设备部件的温度 触摸,超过Table 23 中的允许值,当测量和调整为 11.1.3 中描述;

– 超过 Table 中确定的“其他组件和材料”的允许值 22 乘以 1.5 减去 12.5 °C。在 Table 26、Table 27 和 Table 31 中可以找到绕组的限制。 在所有其他情况下,Table 22 的允许值适用。

应使用 11.1.3 中描述的方法测量温度。

4.7、8.1 b)、8.7.2 和 13.2.2 中的单一故障条件,关于发射 火焰、熔融金属或可燃物质,不得应用于零部件 在哪里: – 构造或供电电路限制了单一故障中的功耗

CONDITION 小于 15 W 或能量耗散小于 900 J。

也许,

^(\d+(?:\.\d+)*)\s+([\s\S]*?)(?=^\d+(?:\.\d+)*)|^(\d+(?:\.\d+)*)\s+([\s\S]*)

可能有点接近我猜测的那些想要的文本。


这里我们只查找以

开头的行
^(\d+(?:\.\d+)*)\s+

然后,我们之后使用

简单地收集任何东西
([\s\S]*?)

直到下一行开始,

(?=^\d+(?:\.\d+)*)

然后,我们可能会也可能不会,这取决于我们的输入看起来如何,只剩下最后一个元素,我们将使用最后一个元素收集它:

^(\d+(?:\.\d+)*)\s+([\s\S]*)

然后我们将(使用 |)更改为先前的表达式。

尽管如此,此方法易于编码,但由于我们使用环视,因此在性能方面相当慢,因此 更好,如果时间复杂度是一个问题,这很可能是。

Demo 1

测试

import re

regex = r"^(\d+(?:\.\d+)*)\s+([\s\S]*?)(?=^\d+(?:\.\d+)*)|^(\d+(?:\.\d+)*)\s+([\s\S]*)"
string = """

I.S. EN 60601-1:2006&A1:2013&AC:2014&A12:2014

60601-1 © IEC:2005 
60601-1 © IEC:2005

– 337 – 
– 169 –

12.4.5.4  Other ME EQUIPMENT producing diagnostic or therapeutic radiation 
When  applicable,  the  MANUFACTURER  shall  address  in  the  RISK  MANAGEMENT PROCESS  the 
RISKS associated  with  ME EQUIPMENT  producing  diagnostic or therapeutic radiation  other  than 
for diagnostic X-rays and radiotherapy (see 12.4.5.2 and 12.4.5.3). 

Compliance is checked by inspection of the RISK MANAGEMENT FILE.

12.4.6  Diagnostic or therapeutic acoustic pressure 
When  applicable,  the  MANUFACTURER  shall  address  in  the  RISK  MANAGEMENT PROCESS  the 
RISKS associated with diagnostic or therapeutic acoustic pressure. 

Compliance is checked by inspection of the RISK MANAGEMENT FILE.

13  *  HAZARDOUS SITUATIONS and fault conditions

13.1  Specific HAZARDOUS SITUATIONS

*  General 

13.1.1 
When  applying  the  SINGLE  FAULT  CONDITIONS  as  described  in  4.7  and listed  in  13.2,  one  at  a 
time,  none  of  the  HAZARDOUS  SITUATIONS  in  13.1.2  to  13.1.4  (inclusive)  shall  occur  in  the 
ME EQUIPMENT.

The failure of any one component at a time, which could result in a  HAZARDOUS  SITUATION, is 
described in 4.7. 

*  Emissions, deformation of ENCLOSURE or exceeding maximum temperature 

13.1.2 
The following HAZARDOUS SITUATIONS shall not occur: 
–  emission  of  flames,  molten  metal,  poisonous  or  ignitable  substance  in  hazardous 

quantities; 

–  deformation of ENCLOSURES to such an extent that compliance with 15.3.1 is impaired; 
– 

temperatures  of  APPLIED  PARTS exceeding  the  allowed  values  identified  in  Table  24  when 
measured as described in 11.1.3; 
temperatures  of  ME EQUIPMENT  parts  that  are  not  APPLIED  PARTS but  are  likely  to  be 
touched,  exceeding  the  allowable  values  in  Table  23  when  measured  and  adjusted  as 
described in 11.1.3; 

– 

–  exceeding the allowable values for “other components and materials” identified in Table 22 
times 1,5 minus 12,5 °C. Limits for windings are found in Table 26, Table 27 and Table 31. 
In all other cases, the allowable values of Table 22 apply. 

Temperatures shall be measured using the method described in 11.1.3. 

The  SINGLE  FAULT  CONDITIONS  in  4.7,  8.1 b),  8.7.2  and  13.2.2,  with  regard  to  the  emission  of 
flames,  molten  metal  or  ignitable  substances,  shall  not  be  applied  to  parts  and  components 
where: 
–  The  construction  or  the  supply  circuit  limits  the  power  dissipation  in  SINGLE  FAULT 

CONDITION to less than 15 W or the energy dissipation to less than 900 J. 

"""

print(re.findall(regex, string, re.M))

输出

[('12.4.5.4', 'Other ME EQUIPMENT producing diagnostic or therapeutic radiation \nWhen applicable, the MANUFACTURER shall address in the RISK MANAGEMENT PROCESS the \nRISKS associated with ME EQUIPMENT producing diagnostic or therapeutic radiation other than \nfor diagnostic X-rays and radiotherapy (see 12.4.5.2 and 12.4.5.3). \n\nCompliance is checked by inspection of the RISK MANAGEMENT FILE.\n\n', '', ''), ('12.4.6', 'Diagnostic or therapeutic acoustic pressure \nWhen applicable, the MANUFACTURER shall address in the RISK MANAGEMENT PROCESS the \nRISKS associated with diagnostic or therapeutic acoustic pressure. \n\nCompliance is checked by inspection of the RISK MANAGEMENT FILE.\n\n', '', ''), ('13', '* HAZARDOUS SITUATIONS and fault conditions\n\n', '', ''), ('13.1', 'Specific HAZARDOUS SITUATIONS\n\n* General \n\n', '', ''), ('13.1.1', 'When applying the SINGLE FAULT CONDITIONS as described in 4.7 and listed in 13.2, one at a \ntime, none of the HAZARDOUS SITUATIONS in 13.1.2 to 13.1.4 (inclusive) shall occur in the \nME EQUIPMENT.\n\nThe failure of any one component at a time, which could result in a HAZARDOUS SITUATION, is \ndescribed in 4.7. \n\n* Emissions, deformation of ENCLOSURE or exceeding maximum temperature \n\n', '', ''), ('', '', '13.1.2', 'The following HAZARDOUS SITUATIONS shall not occur: \n– emission of flames, molten metal, poisonous or ignitable substance in hazardous \n\nquantities; \n\n– deformation of ENCLOSURES to such an extent that compliance with 15.3.1 is impaired; \n– \n\ntemperatures of APPLIED PARTS exceeding the allowed values identified in Table 24 when \nmeasured as described in 11.1.3; \ntemperatures of ME EQUIPMENT parts that are not APPLIED PARTS but are likely to be \ntouched, exceeding the allowable values in Table 23 when measured and adjusted as \ndescribed in 11.1.3; \n\n– \n\n– exceeding the allowable values for “other components and materials” identified in Table 22 \ntimes 1,5 minus 12,5 °C. Limits for windings are found in Table 26, Table 27 and Table 31. \nIn all other cases, the allowable values of Table 22 apply. \n\nTemperatures shall be measured using the method described in 11.1.3. \n\nThe SINGLE FAULT CONDITIONS in 4.7, 8.1 b), 8.7.2 and 13.2.2, with regard to the emission of \nflames, molten metal or ignitable substances, shall not be applied to parts and components \nwhere: \n– The construction or the supply circuit limits the power dissipation in SINGLE FAULT \n\nCONDITION to less than 15 W or the energy dissipation to less than 900 J. \n\n')]

您可以使用您的模式并匹配 space 后跟该行的其余部分。

然后重复匹配以下不以标题开头的所有行。

^\d+(?:\.\d+)* .*(?:\r?\n(?!\d+(?:\.\d+)* ).*)*
  • ^\d+(?:.\d+)* 匹配标题后跟 space
  • 的模式
  • .* 匹配除换行符以外的任何字符 0+ 次
  • (?:非捕获组
    • \r?\n 匹配一个换行符
    • (?! 否定前瞻,断言直接在右边的不是
      • \d+(?:.\d+)* 标题模式
    • ) 关闭前瞻
    • .* 匹配除换行符以外的任何字符 0+ 次
  • )*关闭非捕获组,重复0+次匹配所有行

Regex demo

感谢他们详细的回答和有用的解释,我最终将@The-fourth-bird 的代码和@Emma 的代码的部分内容合并到这个正则表达式中,它似乎可以很好地满足我的需要。

(^\d+(?:\.\d+)*\s+)((?![a-z])[\s\S].*(?:\r?\n))([\s\S]*?)(?=^\d+(?:\.\d+)*\s+(?![a-z]))

这里是REGEX DEMO

我做我想做的,即将(数字标题)、(文字标题)和(文本的body)分成由逗号分隔的组,这样我就可以将它们分成 Excel 通过使用自定义分隔符 ), ( 和其他一些 post 处理。

这个新正则表达式的好处是它会跳过编号标题,这些标题只是参考,而不是实际标题,如下所示:

import pdfplumber
import re
pdfToString = ""

with pdfplumber.open(r"sample.pdf") as pdf:
    for page in pdf.pages:
        print(page.extract_text())
        pdfToString += page.extract_text()

matches = re.findall(r'^\d+(?:\.\d+)* .*(?:\r?\n(?!\d+(?:\.\d+)* ).*)*',pdfToString, re.M)
for i in  matches:
    if "word_to_extract" in i[:50]:
        print(i)

这个解决方案是提取问题中标题格式相同的所有标题,并提取所需的标题及其后面的段落。