为什么 isspace() 对于来自 docx python 库的空字符串返回 false？

Question

我的 objective 是从多个 Microsoft Word 文档的 numbered/bulleted 列表中提取字符串，然后将这些字符串组织成一个单行字符串，其中每个字符串按以下方式排序: 1.string1 2.string2 3.string3 etc. 我将这些单行字符串称为程序，由 'steps' 1., 2., 3. 等

之所以必须采用这种格式，是因为过程字符串被放入数据库中，数据库用于创建 Excel 电子表格输出，电子表格上使用了格式化宏，并且有问题的过程字符串必须采用这种格式才能使该宏正常工作。

MSword中的numbered/bulleted列表格式都差不多，只是有的用数字，有的用项目符号，有的在第一个点之前多了一行space，或者多了一行[=最后一点后 75=]s。

以下文本显示了 Word 文档如何设置格式的三个不同示例：

段落关键字 1：任意文本
1. 第一步
2.第2步
3. 第 3 步
段落关键字 2：任意文本

段落关键字 3：任意文本
• 步骤 1
• 步骤 2
• 第 3 步

段落关键字 4：任意文本

段落关键字 5：任意文本

步骤 1
步骤 2
步骤 3

段落关键字 6：任意文本

（由于某些原因，前两个列表在 post 的格式中没有缩进，但在我的 word 文档中所有缩进都是相同的）

当 numbered/bulleted 列表的格式没有额外的行 spaces 时，我的代码工作正常，例如在 "paragraph keyword 1:" 和 "paragraph keyword 2:" 之间。

我试图使用 isspace() 来隔离存在额外行 space 的实例，这些行不属于我想包含在我的过程字符串中的列表.

这是我的代码：

''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
def extractStrings(file):
    doc = file
    for i in range(len(doc.paragraphs)):
        str1 = doc.paragraphs[i].text
        if "Paragraph Keyword 1:" in str1:
            start1=i
        if "Paragraph Keyword 2:" in str1:
            finish1=i
        if "Paragraph Keyword 3:" in str1:
            start2=i
        if "Paragraph Keyword 4:" in str1:
            finish2=i
        if "Paragraph Keyword 5:" in str1:
            start3=i
        if "Paragraph Keyword 6:" in str1:
            finish3=i
    print("----------------------------")
    procedure1 = ""
    y=1
    for x in range(start1 + 1, finish1):
        temp = str((doc.paragraphs[x].text))
        print(temp)
        if not temp.isspace():
            if y > 1:
                procedure1 = (procedure1 + " " + str(y) + "." + temp)
            else:
                procedure1 = (procedure1 + str(y) + "." + temp)
            y=y+1
            print(procedure1)
    print("----------------------------")
    procedure2 = ""
    y=1
    for x in range(start2 + 1, finish2):
        temp = str((doc.paragraphs[x].text))
        print(temp)
        if not temp.isspace():
            if y > 1:
                procedure2 = (procedure2 + " " + str(y) + "." + temp)
            else:
                procedure2 = (procedure2 + str(y) + "." + temp)
            y=y+1
            print(procedure2)
    print("----------------------------")
    procedure3 = ""
    y=1
    for x in range(start3 + 1, finish3):
        temp = str((doc.paragraphs[x].text))
        print(temp)
        if not temp.isspace():
            if y > 1:
                procedure3 = (procedure3 + " " + str(y) + "." + temp)
            else:
                procedure3 = (procedure3 + str(y) + "." + temp)
            y=y+1
            print(procedure3)
    print("----------------------------")
    del doc
''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''

import docx
doc1 = docx.Document("docx_isspace_experiment_042420.docx")
extractStrings(doc1)
del doc1

不幸的是，我没有办法将输出放入这个post，但问题是每当word doc中有一个空行时，isspace() returns false，并且数字 "x." 被分配给空 space，所以我最终得到类似这样的结果： 1. 2.Step 1 3.Step 2 4.Step 3 5. 6.（这是代码中 print(procedure3) 的最后一次迭代）

问题是 isspace() 是 returning false，即使我的 python 控制台输出显示该字符串只是一个空行。

我是不是用错了 isspace()？我没有检测到的字符串中是否存在导致 isspace() 到 return false 的内容？有没有更好的方法来完成这个？

Answer 1

使用测试：

# --- for s a str value, like paragraph.text ---
if s.strip() == "":
    print("s is a blank line")

str.isspace() returns True 如果字符串只包含空格。空 str 不包含任何内容，因此不包含空格。

为什么 isspace() 对于来自 docx python 库的空字符串返回 false？

Why is isspace() returning false for strings from the docx python library that are empty?

bulletedlist

python-docx

numbered-list

isspace