读取 .docx 文件以提取文本以及文本的字体和其他格式信息

Question

如问题所述，我的目标是找到一个 python 库来从 .docx 文件中提取文本和字体信息。例如，对于以下文本：“hello world” 我需要能够读取字符串 hello 是粗体而不是斜体并且字符串world 不是粗体和斜体。除了知道文本是粗体还是斜体之外，我还需要知道其他信息，例如大小、颜色、字体类型（如 ariel、times new roman）等。我需要能够阅读整个 .docx 文件并提取信息。

我尝试使用 python-docx 库，并且能够提取文本，但不能提取 .docx 文件中的相关字体信息。例如在下面的代码中：

import docx
doc = docx.Document('cg0002.docx')
for para in doc.paragraphs:
    for run in para.runs:
        font = run.font
        is_bold = font.bold

我会得到字体和 is_bold 作为 none。经过进一步的研究，我了解到你不能使用图书馆来阅读 .docx 字体，但你必须自己分配它们。有没有其他库可以用来实现我的目标？

我愿意做出的妥协：我不是特别坚持使用 python 来解决这个问题。我可以使用任何其他语言，如 java、javascript、c/c++、powershell 等。我还可以将文档转换为其他格式，如 pdf，如果它更容易提取信息，前提是文档保持完整（例如，我可以尝试将其上传到 google 文档并使用 appscript 尝试提取文本，但使用 [=29= 查看后某些字体将不会保留] 文档，所以我不想那样做。

Answer 1

对于 DocX，可能 100% 最好使用 VBA 来获取详细信息。

然而，“潜在的”替代途径可能是通过从写字板导出到基本 RTF 来简单地删除任何样式覆盖。再看看目标块的重新定义特征。

注意：- 根据转换情况，这可能无法 100% 可靠地实现您的目标。

虽然我们可以从命令行使用写字板将 DocX 转换为 PDF，但如果不使用 VBS 宏就无法将 DocX 转换为 RTF，但这是另一个问题。

从Header中我们可以看到CodePage=1252 & 2057= ~~English (United Kingdom)~~ British :-)

按眼睛分解 \b\f0\fs24\lang9 Hello \b0\i World\ul\i0 !\ulnone\fs22\par

\b - Is the start of Bold
\f0 - Calibri in the given language (BEWARE here 0 is an index NOT a stop)
\fs24 - Is points x 2 so the text here is 12 point
\lang9 - I forget at the moment, awaiting correction in comments :-)
 Hello - Has both a leading and trailing space (leading is to be ignored)
\b0 - My BAD, boldening STOPS, AFTER the space between the words
\i - Start italics (ignore the space before World)
\ul - Start underlining
\i0 - Stop italics (ignore the space before !)
\ulnone - Stop underline (don't ask me why not \ul0)
\fs22 - I will let you guess the default page font height but by now you know it is not 22

\par - THE END, "That's all Folks!" ™

P.S.

我重新访问了源代码，进行了 2 处更正，看看您是否可以解决这两项更改。关于第二个问题的“我的”线索在上面，但在使用正则表达式时很容易让你出错。

\b\f0\fs22\lang9 Hello,\i \b0 World\ul\i0 !\ulnone\par

虽然它最终应该是

\b\f0\fs22\lang9 Hello,\b0 \i World\ul\i0 !\ulnone\par

读取 .docx 文件以提取文本以及文本的字体和其他格式信息

Reading a .docx file to extract the text along with font and other formatting information of the text

javascript

python

ms-word

wordprocessingml

python-docx