在 Python 中提取段落文本

Question

如何使用 python 搜索 word 文档以在搜索并匹配段落标题后提取段落文本，即“1.2 Broadspectrum Offer 摘要”。

即请参阅下面的文档示例，我基本上想获得以下文本“我们提供的要约摘要，以提供招标文件中概述的工作范围。请参阅各种条款和此处详述的我们要约的条件。另请查看成本明细 "

1.  Executive Summary

1.1 Summary of Services
Energy Savings (Carbon Emissions and Intensity Reduction)
Upgrade Economy Cycle on Level 2,5,6,7 & 8, replace Chilled Water Valves on Level 6 & 8 and install lighting controls on L5 & 6..

1.2 Summary of Broadspectrum Offer

A summary of our Offer to deliver the Scope of Work as outlined in the tender documents is provided below. Please refer to the various terms and conditions of our Offer as detailed herein.
Please also find the cost breakdown

请注意，标题编号从 doc 更改为 doc，不想依赖这个，更多所以我想依赖标题中的搜索文本

到目前为止我可以搜索文档，但这只是一个开始。

filename1 = "North Sydney TE SP30062590-1 HVAC - Project Offer -  Rev1.docx"

from docx import Document

document = Document(filename1)
for paragraph in document.paragraphs:
    if 'Summary' in paragraph.text:
        print paragraph.text

Answer 1

这是一个初步的解决方案（待答复我对您上面 post 的评论）。这还没有考虑在 Summary of Broadspectrum Offer 部分之后附加段落的排除。如果需要，您很可能需要一个小的正则表达式匹配来确定您是否遇到了另一个带有 1.3（等）的 header 部分，如果是这样则停止理解。让我知道这是否是一项要求。

Edit：将 print() 从列表理解方法转换为标准 for 循环，以响应下面 Anton vBR 的评论。

from docx import Document document = Document("North Sydney TE SP30062590-1 HVAC - Project Offer - Rev1.docx") # Find the index of the `Summary of Broadspectrum Offer` syntax and store it ind = [i for i, para in enumerate(document.paragraphs) if 'Summary of Broadspectrum Offer' in para.text] # Print the text for any element with an index greater than the index found in the list comprehension above if ind: for i, para in enumerate(document.paragraphs): if i > ind[0]: print(para.text)

[print(para.text) for i, para in enumerate(document.paragraphs) if ind and i > ind[0]]

>> A summary of our Offer to deliver the Scope of Work as outlined in the tender documents is provided below. Please refer to the various terms and conditions of our Offer as detailed herein. Please also find the cost breakdown

此外，这里还有另一个 post 可能有助于解决另一种方法，即使用段落元数据检测 heading 类型：Extracting headings' text from word doc

在 Python 中提取段落文本

Extract paragraph text in Python

python

extract

docx