将文档部分分成列表以供导出 Python
Break document sections into list for export Python
我是 Python 的新手,我正在尝试将一些法律文件分成多个部分以导出到 SQL。我需要做两件事:
- 根据内容table定义节号,
- 根据定义的节号拆分文档
目录table列出节号:1.1、1.2、1.3等
然后文档本身被这些节号分解:
1.1 "...文本...",
1.2 "...文本...",
1.3 "...文本..."等
类似于一本书的章节,但由升序十进制数分隔。
我使用 Tika 解析了文档,并且我已经能够使用一些基本的正则表达式创建部分列表:
import tika
import re
from tika import parser
parsed = parser.from_file('test.pdf')
content = (parsed["content"])
headers = re.findall("[0-9]*[.][0-9]",content)
现在我需要做这样的事情:
splitsections = content.split() by headers
var_string = ', '.join('?' * len(splitsections))
query_string = 'INSERT INTO table VALUES (%s);' % var_string
cursor.execute(query_string, splitsections)
抱歉,如果所有这些都不清楚。对此还是很陌生。
如能提供任何帮助,我们将不胜感激。
除了最后一部分与 DB 之外的所有内容都经过了测试。代码也可以改进,但这是另一项任务。主线任务完成。
在列表 split_content
中有您想要的所有信息(即 2.1 和 2.2 之间的文本,然后是 2.2 和 2.3,依此类推,不包括 num+sections 本身的名称(即不包括 2.1 Continuation
、2.2 Name
等等)。
我用 PyPDF2
替换了 tika
,因为 tika
不提供此任务所需的工具(即我没有找到如何提供我需要的页数并获取它的内容)。
def get_pdf_content(pdf_path,
start_page_table_contents, end_page_table_contents,
first_parsing_page, last_phrase_to_stop):
"""
:param pdf_path: Full path to the PDF file
:param start_page_table_contents: The page where the "Contents table" starts
:param end_page_table_contents: The page where the "Contents Table" ends
(i.e. the number of the page where Contents Table ENDs, i.e. not the next one)
:param first_parsing_page: The 1st page where we need to start data grabbing
:param last_phrase_to_stop: The phrase that tells the code where to stop grabbing.
The phrase must match exactly what is written in PDF.
This phrase will be excluded from the grabbed data.
:return:
"""
# ======== GRAB TABLE OF CONTENTS ========
start_page = start_page_table_contents
end_page = end_page_table_contents
table_of_contents_page_nums = range(start_page-1, end_page)
sections_of_articles = [] # ['2.1 Continuation', '2.2 Name', ... ]
open_file = open(pdf_path, "rb")
pdf = PyPDF2.PdfFileReader(open_file)
for page_num in table_of_contents_page_nums:
page_content = pdf.getPage(page_num).extractText()
page_sections = re.findall("[\d]+[.][\d][™\s\w;,-]+", page_content)
for section in page_sections:
cleared_section = section.replace('\n', '').strip()
sections_of_articles.append(cleared_section)
# ======== GRAB ALL NECESSARY CONTENT (MERGE ALL PAGES) ========
total_num_pages = pdf.getNumPages()
parsing_pages = range(first_parsing_page-1, total_num_pages)
full_parsing_content = '' # Merged pages
for parsing_page in parsing_pages:
page_content = pdf.getPage(parsing_page).extractText()
cleared_page = page_content.replace('\n', '')
# Remove page num from the start of "page_content"
# Covers the case with the page 65, 71 and others when the "page_content" starts
# with, for example, "616.6 Liability to Partners. (a) It is understood that"
# i.e. "61" is the page num and "6.6 Liability ..." is the section data
already_cleared = False
first_50_chars = cleared_page[:51]
for section in sections_of_articles:
if section in first_50_chars:
indx = cleared_page.index(section)
cleared_page = cleared_page[indx:]
already_cleared = True
break
# Covers all other cases
if not already_cleared:
page_num_to_remove = re.match(r'^\d+', cleared_page)
if page_num_to_remove:
cleared_page = cleared_page[len(str(page_num_to_remove.group(0))):]
full_parsing_content += cleared_page
# ======== BREAK ALL CONTENT INTO PIECES ACCORDING TO TABLE CONTENTS ========
split_content = []
num_sections = len(sections_of_articles)
for num_section in range(num_sections):
start = sections_of_articles[num_section]
# Get the last piece, i.e. "11.16 FATCA" (as there is no any "end" section after "11.16 FATCA", so we cant use
# the logic like "grab info between sections 11.1 and 11.2, 11.2 and 11.3 and so on")
if num_section == num_sections-1:
end = last_phrase_to_stop
else:
end = sections_of_articles[num_section + 1]
content = re.search('%s(.*)%s' % (start, end), full_parsing_content).group(1)
cleared_piece = content.replace('™', "'").strip()
if cleared_piece[0:3] == '. ':
cleared_piece = cleared_piece[3:]
# There are few appearances of "[Signature Page Follows]", as a "last_phrase_to_stop".
# We need the text between "11.16 FATCA" and the 1st appearance of "[Signature Page Follows]"
try:
indx = cleared_piece.index(end)
cleared_piece = cleared_piece[:indx]
except ValueError:
pass
split_content.append(cleared_piece)
# ======== INSERT TO DB ========
# Did not test this section
for piece in split_content:
var_string = ', '.join('?' * len(piece))
query_string = 'INSERT INTO table VALUES (%s);' % var_string
cursor.execute(query_string, parts)
使用方法:(其中一种可能的方式):
1) 将上面的代码保存在my_pdf_code.py
2) 在 python shell:
import path.to.my_pdf_code as the_code
the_code.get_pdf_content('/home/username/Apollo_Investment_Fund_VIII_LPA_S1.pdf', 2, 4, 24, '[Signature Page Follows]')
我是 Python 的新手,我正在尝试将一些法律文件分成多个部分以导出到 SQL。我需要做两件事:
- 根据内容table定义节号,
- 根据定义的节号拆分文档
目录table列出节号:1.1、1.2、1.3等
然后文档本身被这些节号分解: 1.1 "...文本...", 1.2 "...文本...", 1.3 "...文本..."等
类似于一本书的章节,但由升序十进制数分隔。
我使用 Tika 解析了文档,并且我已经能够使用一些基本的正则表达式创建部分列表:
import tika
import re
from tika import parser
parsed = parser.from_file('test.pdf')
content = (parsed["content"])
headers = re.findall("[0-9]*[.][0-9]",content)
现在我需要做这样的事情:
splitsections = content.split() by headers
var_string = ', '.join('?' * len(splitsections))
query_string = 'INSERT INTO table VALUES (%s);' % var_string
cursor.execute(query_string, splitsections)
抱歉,如果所有这些都不清楚。对此还是很陌生。
如能提供任何帮助,我们将不胜感激。
除了最后一部分与 DB 之外的所有内容都经过了测试。代码也可以改进,但这是另一项任务。主线任务完成。
在列表 split_content
中有您想要的所有信息(即 2.1 和 2.2 之间的文本,然后是 2.2 和 2.3,依此类推,不包括 num+sections 本身的名称(即不包括 2.1 Continuation
、2.2 Name
等等)。
我用 PyPDF2
替换了 tika
,因为 tika
不提供此任务所需的工具(即我没有找到如何提供我需要的页数并获取它的内容)。
def get_pdf_content(pdf_path,
start_page_table_contents, end_page_table_contents,
first_parsing_page, last_phrase_to_stop):
"""
:param pdf_path: Full path to the PDF file
:param start_page_table_contents: The page where the "Contents table" starts
:param end_page_table_contents: The page where the "Contents Table" ends
(i.e. the number of the page where Contents Table ENDs, i.e. not the next one)
:param first_parsing_page: The 1st page where we need to start data grabbing
:param last_phrase_to_stop: The phrase that tells the code where to stop grabbing.
The phrase must match exactly what is written in PDF.
This phrase will be excluded from the grabbed data.
:return:
"""
# ======== GRAB TABLE OF CONTENTS ========
start_page = start_page_table_contents
end_page = end_page_table_contents
table_of_contents_page_nums = range(start_page-1, end_page)
sections_of_articles = [] # ['2.1 Continuation', '2.2 Name', ... ]
open_file = open(pdf_path, "rb")
pdf = PyPDF2.PdfFileReader(open_file)
for page_num in table_of_contents_page_nums:
page_content = pdf.getPage(page_num).extractText()
page_sections = re.findall("[\d]+[.][\d][™\s\w;,-]+", page_content)
for section in page_sections:
cleared_section = section.replace('\n', '').strip()
sections_of_articles.append(cleared_section)
# ======== GRAB ALL NECESSARY CONTENT (MERGE ALL PAGES) ========
total_num_pages = pdf.getNumPages()
parsing_pages = range(first_parsing_page-1, total_num_pages)
full_parsing_content = '' # Merged pages
for parsing_page in parsing_pages:
page_content = pdf.getPage(parsing_page).extractText()
cleared_page = page_content.replace('\n', '')
# Remove page num from the start of "page_content"
# Covers the case with the page 65, 71 and others when the "page_content" starts
# with, for example, "616.6 Liability to Partners. (a) It is understood that"
# i.e. "61" is the page num and "6.6 Liability ..." is the section data
already_cleared = False
first_50_chars = cleared_page[:51]
for section in sections_of_articles:
if section in first_50_chars:
indx = cleared_page.index(section)
cleared_page = cleared_page[indx:]
already_cleared = True
break
# Covers all other cases
if not already_cleared:
page_num_to_remove = re.match(r'^\d+', cleared_page)
if page_num_to_remove:
cleared_page = cleared_page[len(str(page_num_to_remove.group(0))):]
full_parsing_content += cleared_page
# ======== BREAK ALL CONTENT INTO PIECES ACCORDING TO TABLE CONTENTS ========
split_content = []
num_sections = len(sections_of_articles)
for num_section in range(num_sections):
start = sections_of_articles[num_section]
# Get the last piece, i.e. "11.16 FATCA" (as there is no any "end" section after "11.16 FATCA", so we cant use
# the logic like "grab info between sections 11.1 and 11.2, 11.2 and 11.3 and so on")
if num_section == num_sections-1:
end = last_phrase_to_stop
else:
end = sections_of_articles[num_section + 1]
content = re.search('%s(.*)%s' % (start, end), full_parsing_content).group(1)
cleared_piece = content.replace('™', "'").strip()
if cleared_piece[0:3] == '. ':
cleared_piece = cleared_piece[3:]
# There are few appearances of "[Signature Page Follows]", as a "last_phrase_to_stop".
# We need the text between "11.16 FATCA" and the 1st appearance of "[Signature Page Follows]"
try:
indx = cleared_piece.index(end)
cleared_piece = cleared_piece[:indx]
except ValueError:
pass
split_content.append(cleared_piece)
# ======== INSERT TO DB ========
# Did not test this section
for piece in split_content:
var_string = ', '.join('?' * len(piece))
query_string = 'INSERT INTO table VALUES (%s);' % var_string
cursor.execute(query_string, parts)
使用方法:(其中一种可能的方式):
1) 将上面的代码保存在my_pdf_code.py
2) 在 python shell:
import path.to.my_pdf_code as the_code
the_code.get_pdf_content('/home/username/Apollo_Investment_Fund_VIII_LPA_S1.pdf', 2, 4, 24, '[Signature Page Follows]')