在 windows 10 上阅读 python 中的 .doc 文件

Question

注意：这被标记为可能与 this 重复，但我的问题是使用 textract 不会起作用。我正在寻找 (a) 使 textract 在 windows 10 上工作的方法或 (b) 替代解决方案。

我正在构建一个需要读取各种类型文件的系统。我已经设置了 pdfminer 来读取 .pdf，并且根据概述的过程 here I installed textract, and I can now also read .docx files. However textract relies on antiword for reading .doc files and I cannot get this to work, even after following the directions here 我无法找到并安装 antiword 的工作版本。我的机器上没有安装 microsoft word，我是运行 windows 10 和 python 3.6.5。还有其他读取 .doc 文件的方法吗？

这里是运行 textract.process('d.doc')时的bug（忽略第一个错误，文件肯定是有的）：

Traceback (most recent call last):
  File "C:\ProgramData\Anaconda3\lib\site-packages\textract\parsers\utils.py", line 84, in run
    stdout=subprocess.PIPE, stderr=subprocess.PIPE,
  File "C:\ProgramData\Anaconda3\lib\subprocess.py", line 709, in __init__
    restore_signals, start_new_session)
  File "C:\ProgramData\Anaconda3\lib\subprocess.py", line 997, in _execute_child
    startupinfo)
FileNotFoundError: [WinError 2] The system cannot find the file specified

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\ProgramData\Anaconda3\lib\site-packages\textract\parsers\__init__.py", line 77, in process
    return parser.process(filename, encoding, **kwargs)
  File "C:\ProgramData\Anaconda3\lib\site-packages\textract\parsers\utils.py", line 46, in process
    byte_string = self.extract(filename, **kwargs)
  File "C:\ProgramData\Anaconda3\lib\site-packages\textract\parsers\doc_parser.py", line 9, in extract
    stdout, stderr = self.run(['antiword', filename])
  File "C:\ProgramData\Anaconda3\lib\site-packages\textract\parsers\utils.py", line 91, in run
    ' '.join(args), 127, '', '',
textract.exceptions.ShellError: The command antiword d.doc failed with exit code 127

Answer 1

我能够使用olefile 获取部分文本，但olefile 最终只处理字节，不处理Word .doc 文件的编码。解决办法是使用LibreOffice，看我的另一个问题

Answer 2

来自 'Windows installation problem' 系列：https://github.com/deanmalmgren/textract/issues/194#issuecomment-507243521

按照 'install' antiword 的步骤操作后，我遇到了与您相同的问题。

在设置环境路径变量后重新启动 windows 完全解决了这个确切的错误消息。（这是我在使用 textract 处理 .doc 文件时遇到的最后一个错误）

安装说明摘自https://github.com/deanmalmgren/textract/issues/194#issuecomment-506065817

"安装 Antiword （我关注了）

转到https://www.softpedia.com/get/Office-tools/Other-Office-Tools/Antiword.shtml
解压到c:\antiword（是的，必须在根目录）
像我们一样将位置添加到路径tesseract-ocr [基本上将 c:\antiword 添加到系统路径（环境变量）]"

在 windows 10 上阅读 python 中的 .doc 文件

Reading .doc files in python on windows 10

text-extraction

.doc

python-3.x