如何分割已提取的文本？

Question

我正在尝试找到一种方法来将我已经提取的文本分成两个变量。我在科学文本上使用它，我想提取摘要和文章的其余部分，例如介绍到结论一分为二，就这么抽象剩下的。

我该怎么做？我试过正则表达式，但无法正常工作。下面你可以看到我用过的一些代码。

with pdfplumber.open("") as pdf:
    all_text = '' # new line
    for pdf_page in pdf.pages:
               single_page_text = pdf_page.extract_text()
               #print( single_page_text )

               all_texts = all_text + '\n' + single_page_text
    #print(all_text)

Answer 1

我假设摘要由字符串“Abstract”和“*Correspondence”包含。我正在使用 str.split() 创建一个列表，其中包含“摘要”前后的文本。我拆分列表的第二个元素，创建一个列表，其中包含“*Correspondence”之前的文本和“*Correspondence”之后的文本。第二个列表的第一个元素是摘要。我将除摘要之外的所有内容附加到另一个变量。由于摘要包含在第一页上，因此这仅适用于第一页。使用枚举选择第一页。

import pdfplumber as pdfplumber

with pdfplumber.open("s12865-020-00390-9.pdf") as pdf:
    text_without_abstract = ''
    abstract = ''
    for index, pdf_page in enumerate(pdf.pages):
        if index == 0:
            single_page_text = pdf_page.extract_text()
            split_at_abstract = single_page_text.split("Abstract")
            text_without_abstract += split_at_abstract[0]
            split_at_asterisk_correspondence = split_at_abstract[1].split("*Correspondence")
            abstract = split_at_asterisk_correspondence[0]
            text_without_abstract += split_at_asterisk_correspondence[1]
        else:
            text_without_abstract += pdf_page.extract_text()

注意：此方法非常依赖于文档的字符串内容。如果字符串“Abstract”出现在摘要内部或摘要后的第一个字符串不是“*Correspondence”，它将不起作用。

str.split() : https://docs.python.org/3.8/library/stdtypes.html#str.split

如何分割已提取的文本？

How to section text that has been extracted?

python

pdf

text-extraction

data-mining