正则表达式仅适用于 docx 文件的一行

Question

我正在尝试读取 docx 文件并将某些单词之间的数据提取到列表中。我想找到数据匹配的所有实例，这是我使用正则表达式所做的。如果数据在同一行，我只会得到一个输出，我认为这与每次 space 之后打印的 str 类型有关（不知道为什么会发生这种情况）示例如下：

下面的代码

import re
from docx import Document

document = Document('myfile.docx')
lst=[]
for para in document.paragraphs:
    orig = para.text
    orig= str(orig)
    print(type(orig))
    output= re.findall(r'sent1([^(]*)sent2',orig)
    print(re.findall(r'sent1([^(]*)sent2',orig))
    lst.append(output)

我的文件在屏幕上的输出：

Heading


Some data here. sent1 this is my data xyz, hello sent2.


Heading 2

Another paragraph here with spaced below.

显示类型时我的文件输出。这是一个字符串，我不知道为什么会这样打印：

<class 'str'>
My data here
<class 'str'>
sent 1 and more data this space
<class 'str'>
sent2 here
sent1 example2 sent2

期望的输出（通过文档在 sent1 和 sent2 之间捕获的所有字符的列表）

output=['and more data this space', 'example2']

当前输出

output=['example2']

Answer 1

好吧，我只是将所有内容合并到一个巨大的字符串中并对其进行正则表达式匹配。例如。所以像这样：

from docx import Document
document = Document('myfile.docx')
 
fulltext = []
for para in document.paragraphs:
    fullText.append(paragraph.text)
fulltext = ' '.join(fulltext)

output = re.findall(r'word1 .* word2', fulltext)

正则表达式仅适用于 docx 文件的一行

Regex only working on one line of a docx file

python

regex

docx

python-3.x