无法使用 python 读取 word 文档中的数字？

Question

我正在使用 python 中的 docx2txt、docx2python 和 docx 等软件包阅读 .Docx 文档。但是，我无法读取特定部分下的数字，而 word 文档中有数字。

[问题前的一些段落]

问题：

问题 1?
问题2？另一个问题？
问题 3?

结论：

与问题 1 相关的文字。
与问题 2 相关的文字。
与问题 3 相关的文字。

我需要确定问题部分下的问题数量，它应该与结论数量相匹配。这样的话，就是3个问题，3个结论。

例如： [[['', 'Executive Summary', 'Context', 'LIBOR products continue to be available across our Global Businesses. We have developed an initial framework for limiting the sale of IBOR based contracts.', 'Questions this paper addresses', '1)\t我们的全球业务制定了哪些框架来限制基于 IBOR 的合约的销售？他们的实施情况如何？', '2)\t决策过程是什么样的？迄今为止做出了哪些决定？ ', '3)\t执行情况如何？ ', 'Conclusions', '1)\t我们的全球业务已经设计了框架和相关的保证模型来管理该框架。', '2)\t决策由各自的业务负责人批准。迄今为止，通用汽车仅撤回了两种产品。', '3)\t框架已经实施并在所有地区生效。保证 model/approach 已实施。', '', 'Input Sought', 'This paper is for noting.', 'Input Received', 'IBOR Transition Programme Lead, IBOR CRO and IBOR Business leads',

Answer 1

这是我写的代码。只有当您的 docx 仍然具有相同格式时，我的算法才有效（问题：\n 1）... \n 2）... \n ... \n 结论：1）... \n 2）... \n ...）。例如，如果你把结论放在问题之前，这是行不通的。

我尝试使用您提供的 docx，它有效。

from docx2python import docx2python
import re

re_bullet = re.compile("[0-9]+\)") # integer followed by a parenthesis pattern

text = docx2python('test.docx').text #Docx to analyze



"""
    Count the number of questions contained into the text.
    Args : 
        text (str) : text to analyze 
    Returns :
        (int) : number of questions 
"""
def count_questions(text):
    result = 0
    lines= text.split("\n") #we split each lines
    for line in lines:
        if(line == "Conclusions:"): #if line contains "Conclusions:" then we stop to count questions.
            break

        if(re_bullet.match(line)): #if the line contains the bullet pattern then increment the result
            result+=1
    return result


"""
    Count the number of conclusions contained into the text.
    Args : 
        text (str) : text to analyze 
    Returns :
        (int) : number of conclusions 
"""
def count_conclusions(text):
    result = 0
    lines= text.split("\n")
    start_conclusion = False  #boolean to check if we are in the part which contains conclusions 
    for line in lines:
        if(line == "Conclusions:"):  #if line contains "Conclusions:" then we can start to count conclusions.
            start_conclusion = True

        if(re_bullet.match(line) and start_conclusion):
            result+=1
    return result


"""
    Check if there are as many questions as there are conclusions.
    Args : 
        text (str) : text to analyze 
    Returns :
        (boolean) : true if there are as many questions as there are conclusions, false if not.
"""
def questions_number_equals_to_conclusions_number(text):
    return count_questions(text) == count_conclusions(text)


print(str(questions_number_equals_to_conclusions_number(text)))

这是结果：

True

无法使用 python 读取 word 文档中的数字？

Not able to read numbers in word documents using python?

python

ms-word

spacy