Python3 Docx 获取 2 个段落之间的文本

Python3 Docx get text between 2 paragraphs

我的目录中有 .docx 文件,我想获取两段之间的所有文本。

示例:

Foo :

The foo is not easy, but we have to do it. We are looking for new things in our ad libitum way of life.

Bar :

我想得到:

The foo is not easy, but we have to do it.
We are looking for new things in our ad libitum way of life. 

我写了这段代码:

import docx
import pathlib
import glob
import re

def rf(f1):
    reader = docx.Document(f1)
    alltext = []
    for p in reader.paragraphs:
        alltext.append(p.text)
    return '\n'.join(alltext)


for f in docxfiles:
    try:
        fulltext = rf(f)
        testf = re.findall(r'Foo\s*:(.*)\s*Bar', fulltext, re.DOTALL)
        
        print(testf)
    except IOError:
        print('Error opening',f)

它returnsNone

我做错了什么?

如果您遍历所有段落并打印段落文本,您将按原样获得文档文本 - 但循环的 单个 p.text 不包含完整文档文字.

它适用于字符串:

t = """Foo :

The foo is not easy, but we have to do it. We are looking for new things in our ad libitum way of life.

Bar :"""
      
import re
      
fread = re.search(r'Foo\s*:(.*)\s*Bar', t)
      
print(fread)  # None  - because dots do not match \n
     
fread = re.search(r'Foo\s*:(.*)\s*Bar', t, re.DOTALL)
      
print(fread)
print(fread[1])

输出:

<_sre.SRE_Match object; span=(0, 115), match='Foo :\n\nThe foo is not easy, but we have to do i>


The foo is not easy, but we have to do it. We are looking for new things in our ad libitum way of life.

如果你使用

for p in reader.paragraphs:
    print("********")
    print(p.text)
    print("********")

你明白为什么你的正则表达式不匹配了。您的正则表达式 适用于整个文档文本。

请参阅How to extract text from an existing docx file using python-docx如何获取整个文档文本。

您也可以查找匹配 r'Foo\s*:' 的段落 - 然后将所有 paragraph.text 放入列表中,直到找到匹配 r'\s*Bar' 的段落。