Python3 Docx 获取 2 个段落之间的文本

Question

我的目录中有 .docx 文件，我想获取两段之间的所有文本。

示例：

Foo :

The foo is not easy, but we have to do it. We are looking for new things in our ad libitum way of life.

Bar :

我想得到：

The foo is not easy, but we have to do it.
We are looking for new things in our ad libitum way of life.

我写了这段代码：

import docx
import pathlib
import glob
import re

def rf(f1):
    reader = docx.Document(f1)
    alltext = []
    for p in reader.paragraphs:
        alltext.append(p.text)
    return '\n'.join(alltext)


for f in docxfiles:
    try:
        fulltext = rf(f)
        testf = re.findall(r'Foo\s*:(.*)\s*Bar', fulltext, re.DOTALL)
        
        print(testf)
    except IOError:
        print('Error opening',f)

它returnsNone

我做错了什么？

Answer 1

如果您遍历所有段落并打印段落文本，您将按原样获得文档文本 - 但循环的单个 p.text 不包含完整文档文字.

它适用于字符串：

t = """Foo :

The foo is not easy, but we have to do it. We are looking for new things in our ad libitum way of life.

Bar :"""
      
import re
      
fread = re.search(r'Foo\s*:(.*)\s*Bar', t)
      
print(fread)  # None  - because dots do not match \n
     
fread = re.search(r'Foo\s*:(.*)\s*Bar', t, re.DOTALL)
      
print(fread)
print(fread[1])

输出：

<_sre.SRE_Match object; span=(0, 115), match='Foo :\n\nThe foo is not easy, but we have to do i>


The foo is not easy, but we have to do it. We are looking for new things in our ad libitum way of life.

如果你使用

for p in reader.paragraphs:
    print("********")
    print(p.text)
    print("********")

你明白为什么你的正则表达式不匹配了。您的正则表达式将适用于整个文档文本。

请参阅How to extract text from an existing docx file using python-docx如何获取整个文档文本。

您也可以查找匹配 r'Foo\s*:' 的段落 - 然后将所有 paragraph.text 放入列表中，直到找到匹配 r'\s*Bar' 的段落。

Python3 Docx 获取 2 个段落之间的文本

Python3 Docx get text between 2 paragraphs

python

regex

python-3.x

python-docx