在目录中搜索所有带有 python-docx 的 docx 文件(批量)
Search all docx files with python-docx in a directory (batch)
我有一堆嵌入了相同 Excel table 的 Word docx
文件。我正在尝试从多个文件中提取相同的单元格。
我想出了如何硬编码到一个文件中:
from docx import Document
document = Document(r"G:\GIS\DESIGN\ROW\ROW_Files\Docx[=11=]6-087-003.docx")
table = document.tables[0]
Project_cell = table.rows[2].cells[2]
paragraph = Project_cell.paragraphs[0]
Project = paragraph.text
print Project
但是我该如何批处理呢?我在 listdir
上尝试了一些变体,但它们对我不起作用,而且我太新手无法独自到达那里。
假设上面的代码得到了你需要的数据,你需要做的就是从磁盘读取文件并处理它们。
首先让我们定义一个函数来完成您已经在做的事情,然后我们将遍历目录中的所有文档并使用该函数处理它们。
编辑以下未经测试的代码以满足您的需要。
# we'll use os.walk to iterate over all the files in the directory
# we're going to make some simplifying assumption:
# 1) all the docs files are in the same directory
# 2) that you want to store in the paragraph in a list.
import document
import os
DOCS = r'G:\GIS\DESIGN\ROW\ROW_Files\Docx'
def get_para(document):
table = document.tables[0]
Project_cell = table.rows[2].cells[2]
paragraph = Project_cell.paragraphs[0]
Project = paragraph.text
return Project
if __name__ == "__main__":
paragraphs = []
f = os.walk(DOCS).next()
for filename in f:
file_name = os.path.join(DOCS, filename)
document = Document(file_name)
para = get_para(document)
paragraphs.append(para)
print(paragraphs)
如何遍历所有文件实际上取决于您的项目可交付成果。所有文件都在一个文件夹中吗?是否不止 .docx
个文件?
为了解决所有问题,我们假设有子目录和其他文件与您的 .docx
文件混合在一起。为此,我们将使用 os.walk()
and os.path.splitext()
import os
from docx import Document
# First, we'll create an empty list to hold the path to all of your docx files
document_list = []
# Now, we loop through every file in the folder "G:\GIS\DESIGN\ROW\ROW_Files\Docx"
# (and all it's subfolders) using os.walk(). You could alternatively use os.listdir()
# to get a list of files. It would be recommended, and simpler, if all files are
# in the same folder. Consider that change a small challenge for developing your skills!
for path, subdirs, files in os.walk(r"G:\GIS\DESIGN\ROW\ROW_Files\Docx"):
for name in files:
# For each file we find, we need to ensure it is a .docx file before adding
# it to our list
if os.path.splitext(os.path.join(path, name))[1] == ".docx":
document_list.append(os.path.join(path, name))
# Now create a loop that goes over each file path in document_list, replacing your
# hard-coded path with the variable.
for document_path in document_list:
document = Document(document_path) # Change the document being loaded each loop
table = document.tables[0]
project_cell = table.rows[2].cells[2]
paragraph = project_cell.paragraphs[0]
project = paragraph.text
print project
如需进一步阅读,请参阅有关 os.listdir()
的文档。
另外,最好把你的代码放到一个可以复用的函数中,但这对你自己也是一个挑战!
我有一堆嵌入了相同 Excel table 的 Word docx
文件。我正在尝试从多个文件中提取相同的单元格。
我想出了如何硬编码到一个文件中:
from docx import Document
document = Document(r"G:\GIS\DESIGN\ROW\ROW_Files\Docx[=11=]6-087-003.docx")
table = document.tables[0]
Project_cell = table.rows[2].cells[2]
paragraph = Project_cell.paragraphs[0]
Project = paragraph.text
print Project
但是我该如何批处理呢?我在 listdir
上尝试了一些变体,但它们对我不起作用,而且我太新手无法独自到达那里。
假设上面的代码得到了你需要的数据,你需要做的就是从磁盘读取文件并处理它们。
首先让我们定义一个函数来完成您已经在做的事情,然后我们将遍历目录中的所有文档并使用该函数处理它们。
编辑以下未经测试的代码以满足您的需要。
# we'll use os.walk to iterate over all the files in the directory
# we're going to make some simplifying assumption:
# 1) all the docs files are in the same directory
# 2) that you want to store in the paragraph in a list.
import document
import os
DOCS = r'G:\GIS\DESIGN\ROW\ROW_Files\Docx'
def get_para(document):
table = document.tables[0]
Project_cell = table.rows[2].cells[2]
paragraph = Project_cell.paragraphs[0]
Project = paragraph.text
return Project
if __name__ == "__main__":
paragraphs = []
f = os.walk(DOCS).next()
for filename in f:
file_name = os.path.join(DOCS, filename)
document = Document(file_name)
para = get_para(document)
paragraphs.append(para)
print(paragraphs)
如何遍历所有文件实际上取决于您的项目可交付成果。所有文件都在一个文件夹中吗?是否不止 .docx
个文件?
为了解决所有问题,我们假设有子目录和其他文件与您的 .docx
文件混合在一起。为此,我们将使用 os.walk()
and os.path.splitext()
import os
from docx import Document
# First, we'll create an empty list to hold the path to all of your docx files
document_list = []
# Now, we loop through every file in the folder "G:\GIS\DESIGN\ROW\ROW_Files\Docx"
# (and all it's subfolders) using os.walk(). You could alternatively use os.listdir()
# to get a list of files. It would be recommended, and simpler, if all files are
# in the same folder. Consider that change a small challenge for developing your skills!
for path, subdirs, files in os.walk(r"G:\GIS\DESIGN\ROW\ROW_Files\Docx"):
for name in files:
# For each file we find, we need to ensure it is a .docx file before adding
# it to our list
if os.path.splitext(os.path.join(path, name))[1] == ".docx":
document_list.append(os.path.join(path, name))
# Now create a loop that goes over each file path in document_list, replacing your
# hard-coded path with the variable.
for document_path in document_list:
document = Document(document_path) # Change the document being loaded each loop
table = document.tables[0]
project_cell = table.rows[2].cells[2]
paragraph = project_cell.paragraphs[0]
project = paragraph.text
print project
如需进一步阅读,请参阅有关 os.listdir()
的文档。
另外,最好把你的代码放到一个可以复用的函数中,但这对你自己也是一个挑战!