使用 Windows 中的反词读取 Python 中的 .doc 文件（也是 .docx）

Question

我尝试读取 .doc 文件，例如 -

with open('file.doc', errors='ignore') as f:
    text = f.read()

它确实读取了那个文件，但是有很多垃圾，我无法删除这些垃圾，因为我不知道它从哪里开始，从哪里结束。

我也尝试安装 textract 模块，它说它可以读取任何文件格式，但是在 Windows.

中下载它时存在很多依赖性问题

所以我交替使用 antiword 命令行实用程序执行此操作，我的答案如下。

Answer 1

您可以使用 antiword 命令行实用程序来执行此操作，我知道你们中的大多数人都会尝试过，但我仍然想分享。

从 here

antiword

将 antiword 文件夹解压到 C:\ 并将路径 C:\antiword 添加到您的 PATH 环境变量。

以下是如何使用它处理 docx 和 doc 文件的示例：

import os, docx2txt
def get_doc_text(filepath, file):
    if file.endswith('.docx'):
       text = docx2txt.process(file)
       return text
    elif file.endswith('.doc'):
       # converting .doc to .docx
       doc_file = filepath + file
       docx_file = filepath + file + 'x'
       if not os.path.exists(docx_file):
          os.system('antiword ' + doc_file + ' > ' + docx_file)
          with open(docx_file) as f:
             text = f.read()
          os.remove(docx_file) #docx_file was just to read, so deleting
       else:
          # already a file with same name as doc exists having docx extension, 
          # which means it is a different file, so we cant read it
          print('Info : file with same name of doc exists having docx extension, so we cant read it')
          text = ''
       return text

现在调用这个函数：

filepath = "D:\input\"
files = os.listdir(filepath)
for file in files:
    text = get_doc_text(filepath, file)
    print(text)

这可能是在 Windows 上读取 Python 中的 .doc 文件的好替代方法。

希望对您有所帮助，谢谢。

Answer 2

Mithilesh 的例子很好，但是安装 antiword 后直接使用 textract 更简单。下载 antiword, and extract the antiword folder to C:\. Then add the antiword folder to your PATH environment variable. (instructions for adding to PATH here)。打开一个新的终端或命令控制台以重新加载您的 PATH 环境变量。使用 pip install textract.

安装 textract

然后您可以像这样使用 textract（对 .doc 文件使用 antiword）：

import textract
text = textract.process('filename.doc')
text.decode('utf-8')  # converts from bytestring to string

如果您遇到错误，请尝试运行来自 terminal/console 的命令 antiword 以确保其有效。还要确保 .doc 文件的文件路径正确（例如使用 os.path.exists('filename.doc')）。

使用 Windows 中的反词读取 Python 中的 .doc 文件（也是 .docx）

Reading .doc file in Python using antiword in Windows (also .docx)

python

doc

docx