Python3 Docx 获取 2 个段落之间的文本
Python3 Docx get text between 2 paragraphs
我的目录中有 .docx 文件,我想获取两段之间的所有文本。
示例:
Foo :
The foo is not easy, but we have to do it.
We are looking for new things in our ad libitum way of life.
Bar :
我想得到:
The foo is not easy, but we have to do it.
We are looking for new things in our ad libitum way of life.
我写了这段代码:
import docx
import pathlib
import glob
import re
def rf(f1):
reader = docx.Document(f1)
alltext = []
for p in reader.paragraphs:
alltext.append(p.text)
return '\n'.join(alltext)
for f in docxfiles:
try:
fulltext = rf(f)
testf = re.findall(r'Foo\s*:(.*)\s*Bar', fulltext, re.DOTALL)
print(testf)
except IOError:
print('Error opening',f)
它returnsNone
我做错了什么?
如果您遍历所有段落并打印段落文本,您将按原样获得文档文本 - 但循环的 单个 p.text
不包含完整文档文字.
它适用于字符串:
t = """Foo :
The foo is not easy, but we have to do it. We are looking for new things in our ad libitum way of life.
Bar :"""
import re
fread = re.search(r'Foo\s*:(.*)\s*Bar', t)
print(fread) # None - because dots do not match \n
fread = re.search(r'Foo\s*:(.*)\s*Bar', t, re.DOTALL)
print(fread)
print(fread[1])
输出:
<_sre.SRE_Match object; span=(0, 115), match='Foo :\n\nThe foo is not easy, but we have to do i>
The foo is not easy, but we have to do it. We are looking for new things in our ad libitum way of life.
如果你使用
for p in reader.paragraphs:
print("********")
print(p.text)
print("********")
你明白为什么你的正则表达式不匹配了。您的正则表达式 将 适用于整个文档文本。
请参阅How to extract text from an existing docx file using python-docx如何获取整个文档文本。
您也可以查找匹配 r'Foo\s*:'
的段落 - 然后将所有 paragraph.text 放入列表中,直到找到匹配 r'\s*Bar'
的段落。
我的目录中有 .docx 文件,我想获取两段之间的所有文本。
示例:
Foo :
The foo is not easy, but we have to do it. We are looking for new things in our ad libitum way of life.
Bar :
我想得到:
The foo is not easy, but we have to do it.
We are looking for new things in our ad libitum way of life.
我写了这段代码:
import docx
import pathlib
import glob
import re
def rf(f1):
reader = docx.Document(f1)
alltext = []
for p in reader.paragraphs:
alltext.append(p.text)
return '\n'.join(alltext)
for f in docxfiles:
try:
fulltext = rf(f)
testf = re.findall(r'Foo\s*:(.*)\s*Bar', fulltext, re.DOTALL)
print(testf)
except IOError:
print('Error opening',f)
它returnsNone
我做错了什么?
如果您遍历所有段落并打印段落文本,您将按原样获得文档文本 - 但循环的 单个 p.text
不包含完整文档文字.
它适用于字符串:
t = """Foo :
The foo is not easy, but we have to do it. We are looking for new things in our ad libitum way of life.
Bar :"""
import re
fread = re.search(r'Foo\s*:(.*)\s*Bar', t)
print(fread) # None - because dots do not match \n
fread = re.search(r'Foo\s*:(.*)\s*Bar', t, re.DOTALL)
print(fread)
print(fread[1])
输出:
<_sre.SRE_Match object; span=(0, 115), match='Foo :\n\nThe foo is not easy, but we have to do i>
The foo is not easy, but we have to do it. We are looking for new things in our ad libitum way of life.
如果你使用
for p in reader.paragraphs:
print("********")
print(p.text)
print("********")
你明白为什么你的正则表达式不匹配了。您的正则表达式 将 适用于整个文档文本。
请参阅How to extract text from an existing docx file using python-docx如何获取整个文档文本。
您也可以查找匹配 r'Foo\s*:'
的段落 - 然后将所有 paragraph.text 放入列表中,直到找到匹配 r'\s*Bar'
的段落。