在 Python 中提取段落文本
Extract paragraph text in Python
如何使用 python 搜索 word 文档以在搜索并匹配段落标题后提取段落文本,即“1.2 Broadspectrum Offer 摘要”。
即请参阅下面的文档示例,我基本上想获得以下文本“我们提供的要约摘要,以提供招标文件中概述的工作范围。请参阅各种条款和此处详述的我们要约的条件。
另请查看成本明细 "
1. Executive Summary
1.1 Summary of Services
Energy Savings (Carbon Emissions and Intensity Reduction)
Upgrade Economy Cycle on Level 2,5,6,7 & 8, replace Chilled Water Valves on Level 6 & 8 and install lighting controls on L5 & 6..
1.2 Summary of Broadspectrum Offer
A summary of our Offer to deliver the Scope of Work as outlined in the tender documents is provided below. Please refer to the various terms and conditions of our Offer as detailed herein.
Please also find the cost breakdown
请注意,标题编号从 doc 更改为 doc,不想依赖这个,更多所以我想依赖标题中的搜索文本
到目前为止我可以搜索文档,但这只是一个开始。
filename1 = "North Sydney TE SP30062590-1 HVAC - Project Offer - Rev1.docx"
from docx import Document
document = Document(filename1)
for paragraph in document.paragraphs:
if 'Summary' in paragraph.text:
print paragraph.text
这是一个初步的解决方案(待答复我对您上面 post 的评论)。这还没有考虑在 Summary of Broadspectrum Offer
部分 之后 附加段落的排除。如果需要,您很可能需要一个小的正则表达式匹配来确定您是否遇到了另一个带有 1.3
(等)的 header 部分,如果是这样则停止理解。让我知道这是否是一项要求。
Edit:将 print()
从列表理解方法转换为标准 for
循环,以响应下面 Anton vBR
的评论。
from docx import Document
document = Document("North Sydney TE SP30062590-1 HVAC - Project Offer - Rev1.docx")
# Find the index of the `Summary of Broadspectrum Offer` syntax and store it
ind = [i for i, para in enumerate(document.paragraphs) if 'Summary of Broadspectrum Offer' in para.text]
# Print the text for any element with an index greater than the index found in the list comprehension above
if ind:
for i, para in enumerate(document.paragraphs):
if i > ind[0]:
print(para.text)
[print(para.text) for i, para in enumerate(document.paragraphs) if ind and i > ind[0]]
>> A summary of our Offer to deliver the Scope of Work as outlined in the tender documents is provided below.
Please refer to the various terms and conditions of our Offer as detailed herein.
Please also find the cost breakdown
此外,这里还有另一个 post 可能有助于解决另一种方法,即使用段落元数据检测 heading
类型:Extracting headings' text from word doc
如何使用 python 搜索 word 文档以在搜索并匹配段落标题后提取段落文本,即“1.2 Broadspectrum Offer 摘要”。
即请参阅下面的文档示例,我基本上想获得以下文本“我们提供的要约摘要,以提供招标文件中概述的工作范围。请参阅各种条款和此处详述的我们要约的条件。 另请查看成本明细 "
1. Executive Summary
1.1 Summary of Services
Energy Savings (Carbon Emissions and Intensity Reduction)
Upgrade Economy Cycle on Level 2,5,6,7 & 8, replace Chilled Water Valves on Level 6 & 8 and install lighting controls on L5 & 6..
1.2 Summary of Broadspectrum Offer
A summary of our Offer to deliver the Scope of Work as outlined in the tender documents is provided below. Please refer to the various terms and conditions of our Offer as detailed herein.
Please also find the cost breakdown
请注意,标题编号从 doc 更改为 doc,不想依赖这个,更多所以我想依赖标题中的搜索文本
到目前为止我可以搜索文档,但这只是一个开始。
filename1 = "North Sydney TE SP30062590-1 HVAC - Project Offer - Rev1.docx"
from docx import Document
document = Document(filename1)
for paragraph in document.paragraphs:
if 'Summary' in paragraph.text:
print paragraph.text
这是一个初步的解决方案(待答复我对您上面 post 的评论)。这还没有考虑在 Summary of Broadspectrum Offer
部分 之后 附加段落的排除。如果需要,您很可能需要一个小的正则表达式匹配来确定您是否遇到了另一个带有 1.3
(等)的 header 部分,如果是这样则停止理解。让我知道这是否是一项要求。
Edit:将 print()
从列表理解方法转换为标准 for
循环,以响应下面 Anton vBR
的评论。
from docx import Document
document = Document("North Sydney TE SP30062590-1 HVAC - Project Offer - Rev1.docx")
# Find the index of the `Summary of Broadspectrum Offer` syntax and store it
ind = [i for i, para in enumerate(document.paragraphs) if 'Summary of Broadspectrum Offer' in para.text]
# Print the text for any element with an index greater than the index found in the list comprehension above
if ind:
for i, para in enumerate(document.paragraphs):
if i > ind[0]:
print(para.text)
[print(para.text) for i, para in enumerate(document.paragraphs) if ind and i > ind[0]]
>> A summary of our Offer to deliver the Scope of Work as outlined in the tender documents is provided below.
Please refer to the various terms and conditions of our Offer as detailed herein.
Please also find the cost breakdown
此外,这里还有另一个 post 可能有助于解决另一种方法,即使用段落元数据检测 heading
类型:Extracting headings' text from word doc