使用 Python 将 DOCX 文件转换为文本文件
DOCX file to text file conversion using Python
我编写了以下代码将我的 docx 文件转换为文本文件。我在文本文件中打印的输出是整个文件的最后一个 paragraph/part,而不是完整的内容。代码如下:
from docx import Document
import io
import shutil
def convertDocxToText(path):
for d in os.listdir(path):
fileExtension=d.split(".")[-1]
if fileExtension =="docx":
docxFilename = path + d
print(docxFilename)
document = Document(docxFilename)
# for printing the complete document
print('\nThe whole content of the document:->>>\n')
for para in document.paragraphs:
textFilename = path + d.split(".")[0] + ".txt"
with io.open(textFilename,"w", encoding="utf-8") as textFile:
#textFile.write(unicode(para.text))
x=unicode(para.text)
print(x) //the complete content gets printed by this line
textFile.write((x)) #after writing the content to text file only last paragraph is copied.
#textFile.write(para.text)
path= "/home/python/resumes/"
convertDocxToText(path)
问题
正如您的代码在最后一个 for
循环中所说:
for para in document.paragraphs:
textFilename = path + d.split(".")[0] + ".txt"
with io.open(textFilename,"w", encoding="utf-8") as textFile:
x=unicode(para.text)
textFile.write((x))
对于整个文档中的每个段落,您尝试打开一个名为 textFilename
的文件,假设您在 /home/python/resumes/
中有一个名为 MyFile.docx
的文件,因此 textFilename
包含路径的值将在整个 for
循环中始终为 /home/python/resumes/MyFile.txt
,所以问题是您在 w
模式下打开同一个文件,这是 Write
模式,并将覆盖整个文件内容。
解决方案:
您必须在 for 循环中打开文件一次,然后尝试将段落一段一段地添加到其中。
以上问题的解决方法如下:
from docx import Document
import io
import shutil
import os
def convertDocxToText(path):
for d in os.listdir(path):
fileExtension=d.split(".")[-1]
if fileExtension =="docx":
docxFilename = path + d
print(docxFilename)
document = Document(docxFilename)
textFilename = path + d.split(".")[0] + ".txt"
with io.open(textFilename,"w", encoding="utf-8") as textFile:
for para in document.paragraphs:
textFile.write(unicode(para.text))
path= "/home/python/resumes/"
convertDocxToText(path)
我编写了以下代码将我的 docx 文件转换为文本文件。我在文本文件中打印的输出是整个文件的最后一个 paragraph/part,而不是完整的内容。代码如下:
from docx import Document
import io
import shutil
def convertDocxToText(path):
for d in os.listdir(path):
fileExtension=d.split(".")[-1]
if fileExtension =="docx":
docxFilename = path + d
print(docxFilename)
document = Document(docxFilename)
# for printing the complete document
print('\nThe whole content of the document:->>>\n')
for para in document.paragraphs:
textFilename = path + d.split(".")[0] + ".txt"
with io.open(textFilename,"w", encoding="utf-8") as textFile:
#textFile.write(unicode(para.text))
x=unicode(para.text)
print(x) //the complete content gets printed by this line
textFile.write((x)) #after writing the content to text file only last paragraph is copied.
#textFile.write(para.text)
path= "/home/python/resumes/"
convertDocxToText(path)
问题
正如您的代码在最后一个 for
循环中所说:
for para in document.paragraphs:
textFilename = path + d.split(".")[0] + ".txt"
with io.open(textFilename,"w", encoding="utf-8") as textFile:
x=unicode(para.text)
textFile.write((x))
对于整个文档中的每个段落,您尝试打开一个名为 textFilename
的文件,假设您在 /home/python/resumes/
中有一个名为 MyFile.docx
的文件,因此 textFilename
包含路径的值将在整个 for
循环中始终为 /home/python/resumes/MyFile.txt
,所以问题是您在 w
模式下打开同一个文件,这是 Write
模式,并将覆盖整个文件内容。
解决方案:
您必须在 for 循环中打开文件一次,然后尝试将段落一段一段地添加到其中。
以上问题的解决方法如下:
from docx import Document
import io
import shutil
import os
def convertDocxToText(path):
for d in os.listdir(path):
fileExtension=d.split(".")[-1]
if fileExtension =="docx":
docxFilename = path + d
print(docxFilename)
document = Document(docxFilename)
textFilename = path + d.split(".")[0] + ".txt"
with io.open(textFilename,"w", encoding="utf-8") as textFile:
for para in document.paragraphs:
textFile.write(unicode(para.text))
path= "/home/python/resumes/"
convertDocxToText(path)