使用正则表达式从 PDF 原始文本中提取子字符串

Extracting substring from PDF raw text using regex

我试图从 pdf 文档中提取具有罗马索引的小节。

例如这是文档的一部分,

\n1.1\n \nSCOPE\n \nThis PTS specifies the\n \nrequirements \nand recommendations for Classification, Verification \n\nFunct\nions.\n \nThe scope includes the following:\n \ni.\n \nSemi\n-\nquantitative SIL classification\n \nii.\n \nSpurious trip analysis\n \niii.\n \nProbabilistic and architectural SIL verification\n \niv.\n \nRecommendations\n \nfor SIL gap closure'

我要的只有下面:

This PTS specifies the\n \nrequirements \nand recommendations for Classification, Verification \n\nFunct\nions.\n \nThe scope includes the following:\n \ni.\n \nSemi\n-\nquantitative SIL classification\n \nii.\n \nSpurious trip analysis\n \niii.\n \nProbabilistic and architectural SIL verification\n \niv.\n \nRecommendations\n \nfor SIL gap closure

我需要罗马索引前的句子以及罗马索引中的内容。

不过,也有像下面这样的情况

3.1.3\n \nDo\nc\numentation\n \nrequired\n \nT\nh\ne\n \nl\nat\ne\ns\nt\n \nissue\n \nof\n \nt\nh\ne\n \nf\no\nllo\nw\ni\nng\n \ndocume\nn\nts\n \nshall\n \nbe\n \nav\na\nilab\nl\ne\n \nto\n \nthe\n \nte\na\nm\n \np\ne\nrf\no\nrm\ni\nng\n \nt\nh\ne \nc\nl\nass\ni\nf\ni\ncati\no\nn:\n \ni.\n \nMandatory reference document\n \na)\n \nCause and effect matrices (CEM)\n \nb)\n \nPiping and Instrument Diagram (P&ID) or Process and utility engineering \nflow schemes (PEFS)\n \nc)\n \nHAZOP report\n \nd)\n \nIPF reliability data\n \nii.\n \nOther reference document\n \na)\n \nProcess Flow Diagram (PFD) or Process Fl\now Scheme (PFS)\n \nb)\n \nPlant layout drawing\n \nc)\n \nProcess safeguarding flow schemes (PSFS)\n \nd)\n \nControl narratives\n \ne)\n \nInterlocks/ ESD logic diagram\n \nf)\n \nEquipment layout diagram\n \ng)\n \nMaintenance and Inspection Data\n \nh)\n \nPlant historian data\n \n \nT\nh\ne\n \nl\ni\ns\nt\n \na\nb\no\nve\n \nis\n \nn\no\nt\n \ne\nx\nh\na\nu\nsti\nv\ne. Any\n \not\nh\ne\nr\n \ndo\nc\nu\nm\ne\nn\nt\ns\n/ \nd\nr\na\nw\nin\ng\ns\n \nreq\nu\nir\ne\nd\n \nf\no\nr\n \nt\nhe \nc\nom\np\nletion\n \no\nf the\n \nIPF\n \ns\nt\nu\nd\ny\n \ns\nh\na\nll\n \nbe\n \nf\nu\nr\nn\nished\n \nas\n \na\nn\nd\n \nw\nhen\n \nre\nq\nui\nr\ne\nd\n.\n \n

我已将 pdf 转换为原始文本,并且成功提取了文档的一部分。

regx = re.compile( '\.\n \n.+?:\n \n',re.DOTALL)
find = str(txt)
indexhead.append((regx.findall(find)))

以上代码只能提取标题,不能提取罗马索引

.\n \nThe scope includes the following:\n \n

我正在尝试根据模式进行提取,但我认为一些条件规则可能会有所帮助。

如果我对问题的理解正确,我们只想取出罗马索引,得到整个段落,我们将从一个简单的表达式开始,例如:

.+[0-9]\.?.+?([A-Z][a-z].*)

然后随着新案例的出现,我们将只使用逻辑 OR 并添加额外的规则。

Demo

测试

# coding=utf8
# the above tag defines encoding for this document and is for Python 2.x compatibility

import re

regex = r".+[0-9]\.?.+?([A-Z][a-z].*)"

test_str = ("\n1.1\n \nSCOPE\n \nThis PTS specifies the\n \nrequirements \nand recommendations for Classification, Verification \n\nFunct\nions.\n \nThe scope includes the following:\n \ni.\n \nSemi\n-\nquantitative SIL classification\n \nii.\n \nSpurious trip analysis\n \niii.\n \nProbabilistic and architectural SIL verification\n \niv.\n \nRecommendations\n \nfor SIL gap closure'\n\n"
    "3.1.3\n \nDo\nc\numentation\n \nrequired\n \nT\nh\ne\n \nl\nat\ne\ns\nt\n \nissue\n \nof\n \nt\nh\ne\n \nf\no\nllo\nw\ni\nng\n \ndocume\nn\nts\n \nshall\n \nbe\n \nav\na\nilab\nl\ne\n \nto\n \nthe\n \nte\na\nm\n \np\ne\nrf\no\nrm\ni\nng\n \nt\nh\ne \nc\nl\nass\ni\nf\ni\ncati\no\nn:\n \ni.\n \nMandatory reference document\n \na)\n \nCause and effect matrices (CEM)\n \nb)\n \nPiping and Instrument Diagram (P&ID) or Process and utility engineering \nflow schemes (PEFS)\n \nc)\n \nHAZOP report\n \nd)\n \nIPF reliability data\n \nii.\n \nOther reference document\n \na)\n \nProcess Flow Diagram (PFD) or Process Fl\now Scheme (PFS)\n \nb)\n \nPlant layout drawing\n \nc)\n \nProcess safeguarding flow schemes (PSFS)\n \nd)\n \nControl narratives\n \ne)\n \nInterlocks/ ESD logic diagram\n \nf)\n \nEquipment layout diagram\n \ng)\n \nMaintenance and Inspection Data\n \nh)\n \nPlant historian data\n \n \nT\nh\ne\n \nl\ni\ns\nt\n \na\nb\no\nve\n \nis\n \nn\no\nt\n \ne\nx\nh\na\nu\nsti\nv\ne. Any\n \not\nh\ne\nr\n \ndo\nc\nu\nm\ne\nn\nt\ns\n/ \nd\nr\na\nw\nin\ng\ns\n \nreq\nu\nir\ne\nd\n \nf\no\nr\n \nt\nhe \nc\nom\np\nletion\n \no\nf the\n \nIPF\n \ns\nt\nu\nd\ny\n \ns\nh\na\nll\n \nbe\n \nf\nu\nr\nn\nished\n \nas\n \na\nn\nd\n \nw\nhen\n \nre\nq\nui\nr\ne\nd\n.\n \n")

subst = "\1"

# You can manually specify the number of replacements by changing the 4th argument
result = re.sub(regex, subst, test_str, 0, re.MULTILINE)

if result:
    print (result)

# Note: for Python 2.7 compatibility, use ur"" to prefix the regex and u"" to prefix the test string and substitution.

正则表达式

如果不需要这个表达式,它可以是 modified/changed in regex101.com

正则表达式电路

jex.im 可视化正则表达式:

经过一些探索,下面是最接近我想要实现的解决方案:

regx = re.compile( ': \ni(?:(?!\n[A-Z]).).*?\.\n\d\.|:\ni(?:(?!\n[A-Z]).).*?\.\n\d\.',re.DOTALL)
find = str(cleanSectionContent2[req])

它检测那些以“:i”开头的情况。并以 header '\n\d.' 部分结束,但它无法检测到所有情况,因此我将在此处更新更多解决方案。