如何通过页码访问 OpenXML 内容？

Question

使用OpenXML，我可以按页码阅读文档内容吗？

wordDocument.MainDocumentPart.Document.Body给出完整文档的内容。

  public void OpenWordprocessingDocumentReadonly()
        {
            string filepath = @"C:\...\test.docx";
            // Open a WordprocessingDocument based on a filepath.
            using (WordprocessingDocument wordDocument =
                WordprocessingDocument.Open(filepath, false))
            {
                // Assign a reference to the existing document body.  
                Body body = wordDocument.MainDocumentPart.Document.Body;
                int pageCount = 0;
                if (wordDocument.ExtendedFilePropertiesPart.Properties.Pages.Text != null)
                {
                    pageCount = Convert.ToInt32(wordDocument.ExtendedFilePropertiesPart.Properties.Pages.Text);
                }
                for (int i = 1; i <= pageCount; i++)
                {
                    //Read the content by page number
                }
            }
        }

MSDN Reference

更新 1：

分页符设置如下

<w:p w:rsidR="003328B0" w:rsidRDefault="003328B0">
        <w:r>
            <w:br w:type="page" />
        </w:r>
    </w:p>

所以现在我需要用上面的检查拆分 XML 并为每个取 InnerTex，这将给我页面虎钳文本。

现在的问题是如何拆分 XML 与上述检查？

更新二：

仅当您有分页符时才设置分页符，但如果文本从一页浮动到其他页面，则没有分页符 XML 元素已设置，因此它恢复到相同的挑战如何识别分页。

Answer 1

我就是这样完成的。

  public void OpenWordprocessingDocumentReadonly()
        {
            string filepath = @"C:\...\test.docx";
            // Open a WordprocessingDocument based on a filepath.
            Dictionary<int, string> pageviseContent = new Dictionary<int, string>();
            int pageCount = 0;
            using (WordprocessingDocument wordDocument =
                WordprocessingDocument.Open(filepath, false))
            {
                // Assign a reference to the existing document body.  
                Body body = wordDocument.MainDocumentPart.Document.Body;
                if (wordDocument.ExtendedFilePropertiesPart.Properties.Pages.Text != null)
                {
                    pageCount = Convert.ToInt32(wordDocument.ExtendedFilePropertiesPart.Properties.Pages.Text);
                }
                int i = 1;
                StringBuilder pageContentBuilder = new StringBuilder();
                foreach (var element in body.ChildElements)
                {
                    if (element.InnerXml.IndexOf("<w:br w:type=\"page\" />", StringComparison.OrdinalIgnoreCase) < 0)
                    {
                        pageContentBuilder.Append(element.InnerText);
                    }
                    else
                    {
                        pageviseContent.Add(i, pageContentBuilder.ToString());
                        i++;
                        pageContentBuilder = new StringBuilder();
                    }
                    if (body.LastChild == element && pageContentBuilder.Length > 0)
                    {
                        pageviseContent.Add(i, pageContentBuilder.ToString());
                    }
                }
            }
        }

缺点：这不会在所有情况下都有效。这仅在您有分页符时有效，但如果您将文本从第 1 页扩展到第 2 页，则没有标识符可以知道您在第 2 页。

Answer 2

您不能仅在 OOXML 数据级别通过页码编号引用 OOXML 内容。

硬分页符 不是问题；可以计算硬分页符。
软分页符 是问题所在。这些是根据计算的实现的换行和分页算法受抚养人；它不是 OOXML 数据固有的。空无一物数一数。

那w:lastRenderedPageBreak呢，它是文档上次呈现时软分页符位置的记录？ 不，w:lastRenderedPageBreak 通常也没有帮助，因为:

根据定义，w:lastRenderedPageBreak 当内容有自上次打开后被对其进行分页的程序更改内容。
在 MS Word 的实现中，w:lastRenderedPageBreak 已知在各种情况下都不可靠，包括

如果您愿意接受对 Word Automation 及其所有固有的依赖性 licensing and server operation limitations，那么您就有机会确定页面边界、页码、页数等

否则，唯一真正的答案是超越依赖于专有的、特定于实现的分页算法的基于页面的引用框架。

Answer 3

List Allparagraphs = wp.MainDocumentPart.Document.Body.OfType().ToList();

List PageParagraphs = Allparagraphs.Where (x=>x.Descendants().Count() ==1) .Select(x => x) .Distinct().ToList();

Answer 4

Rename docx to zip. Open docProps\app.xml file. :

 <?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<Properties xmlns="http://schemas.openxmlformats.org/officeDocument/2006/extended-properties" xmlns:vt="http://schemas.openxmlformats.org/officeDocument/2006/docPropsVTypes">
  <Template>Normal</Template>
  <TotalTime>0</TotalTime>
  <Pages>1</Pages>
  <Words>141</Words>
  <Characters>809</Characters>
  <Application>Microsoft Office Word</Application>
  <DocSecurity>0</DocSecurity>
  <Lines>6</Lines>
  <Paragraphs>1</Paragraphs>
  <ScaleCrop>false</ScaleCrop>
  <HeadingPairs>
    <vt:vector size="2" baseType="variant">
      <vt:variant>
        <vt:lpstr>Название</vt:lpstr>
      </vt:variant>
      <vt:variant>
        <vt:i4>1</vt:i4>
      </vt:variant>
    </vt:vector>
  </HeadingPairs>
  <TitlesOfParts>
    <vt:vector size="1" baseType="lpstr">
      <vt:lpstr/>
    </vt:vector>
  </TitlesOfParts>
  <Company/>
  <LinksUpToDate>false</LinksUpToDate>
  <CharactersWithSpaces>949</CharactersWithSpaces>
  <SharedDoc>false</SharedDoc>
  <HyperlinksChanged>false</HyperlinksChanged>
  <AppVersion>14.0000</AppVersion>
</Properties>

OpenXML 库从 <Pages>1</Pages> property 读取 wordDocument.ExtendedFilePropertiesPart.Properties.Pages.Text。此属性仅由 winword 应用程序创建。如果 word 文档更改 wordDocument.ExtendedFilePropertiesPart.Properties.Pages.Text 不是实际的。如果以编程方式创建 word 文档，则 wordDocument.ExtendedFilePropertiesPart 通常为空。

Answer 5

不幸的是，作为 answers, docx dose not contains reliable page number service. Xml files carry no page number, until microsoft Word open it and render dynamically. Even you read openxml documents like https://docs.microsoft.com/en-us/dotnet/api/documentformat.openxml.wordprocessing.pagenumber?view=openxml-2.8.1 .

您可以解压一些docx文件，然后搜索“page”或“pg”。然后你就会知道了。在我的情况下，我对不同类型的 docx 文件执行此操作。所有人都告诉我同样的道理。很高兴这对您有帮助。

如何通过页码访问 OpenXML 内容？

How to access OpenXML content by page number?

c#

xml

docx

openxml

openxml-sdk