如何使用 python-pptx 按演示文稿中的顺序从 powerpoint 文本框中提取文本。

Question

我的 PowerPoint 幻灯片由文本框组成，有时在组形状内。从这些中提取数据时，文本不会按顺序提取。有时先提取ppt末尾的文本框，有时提取中间的文本框等等。

以下代码从文本框中获取文本并处理组对象。

for eachfile in files:    
    prs = Presentation(eachfile)
    textrun=[]
    # ---Only on text-boxes outside group elements---
    for slide in prs.slides:
        for shape in slide.shapes:
            if hasattr(shape, "text"):
                print(shape.text)
                textrun.append(shape.text)

        # ---Only operate on group shapes---
        group_shapes = [shp for shp in slide.shapes 
                        if shp.shape_type ==MSO_SHAPE_TYPE.GROUP]
        for group_shape in group_shapes:
            for shape in group_shape.shapes:
                if shape.has_text_frame:
                    print(shape.text)
                    textrun.append(shape.text)
    new_list=" ".join(textrun)
    text_list.append(new_list)

print(text_list)

我想根据幻灯片中出现的顺序过滤提取的一些数据。函数根据什么决定顺序？应该怎么做才能解决这个问题？

Answer 1

史蒂夫的评论很对；返回的形状：

for shape in slide.shapes:
    ...

处于底层 XML 的文档顺序，这也是建立 z-order 的原因。 Z-order 是 "stacking" 顺序，就好像每个形状都在一个单独的透明 sheet（层）上，第一个返回的形状在底部，每个后续形状都添加到堆栈的顶部（并重叠它下面的任何内容）。

我想你在这里追求的是从左到右，从上到下的东西。您需要编写自己的代码来按此顺序对形状进行排序，使用 shape.left 和 shape.top。

这样的事情可能会成功：

def iter_textframed_shapes(shapes): """Generate shape objects in *shapes* that can contain text. Shape objects are generated in document order (z-order), bottom to top. """ for shape in shapes: # ---recurse on group shapes--- if shape.shape_type == MSO_SHAPE_TYPE.GROUP: group_shape = shape for shape in iter_textable_shapes(group_shape.shapes): yield shape continue # ---otherwise, treat shape as a "leaf" shape--- if shape.has_text_frame: yield shape textable_shapes = list(iter_textframed_shapes(slide.shapes)) ordered_textable_shapes = sorted( textable_shapes, key=lambda shape: (shape.top, shape.left) ) for shape in ordered_textable_shapes: print(shape.text)

如何使用 python-pptx 按演示文稿中的顺序从 powerpoint 文本框中提取文本。

How to extract text from powerpoint text boxes, in their order within the presentation using python-pptx.

python

powerpoint

text

python-pptx