如何使用 OpenXML 处理 smartTag 节点

How to handle smartTag nodes using OpenXML

我有一个 C# 应用程序,它使用 OpenXML 从 word (.docx) 文件中读取文本。

一般来说,有一组段落 (p) 包含 运行 个元素 (r)。 我可以使用

遍历 运行 个节点
foreach ( var run in para.Descendants<Run>() )
{
  ...
}

在一个特定的文档中有一个文本 "START",它被分成三个部分,"ST"、"AR" 和 "T"。它们中的每一个都由 运行 节点定义,但在两种情况下,运行 节点包含在 "smartTag" 节点中。

<w:smartTag w:uri="urn:schemas-microsoft-com:office:smarttags" w:element="PersonName">
    <w:r w:rsidRPr="00BF444F">
        <w:rPr>
            <w:rFonts w:ascii="Arial" w:hAnsi="Arial" w:cs="Arial"/>
            <w:b/>
            <w:bCs/>
            <w:sz w:val="40"/>
            <w:szCs w:val="40"/>
        </w:rPr>
        <w:t>ST</w:t>
    </w:r>
</w:smartTag>
<w:smartTag w:uri="urn:schemas-microsoft-com:office:smarttags" w:element="PersonName">
    <w:r w:rsidRPr="00BF444F">
        <w:rPr>
            <w:rFonts w:ascii="Arial" w:hAnsi="Arial" w:cs="Arial"/>
            <w:b/>
            <w:bCs/>
            <w:sz w:val="40"/>
            <w:szCs w:val="40"/>
        </w:rPr>
        <w:t>AR</w:t>
    </w:r>
</w:smartTag>
<w:r w:rsidRPr="00BF444F">
    <w:rPr>
        <w:rFonts w:ascii="Arial" w:hAnsi="Arial" w:cs="Arial"/>
        <w:b/>
        <w:bCs/>
        <w:sz w:val="40"/>
        <w:szCs w:val="40"/>
    </w:rPr>
    <w:t xml:space="preserve">T</w:t>
</w:r>

据我所知,OpenXML 不支持 smartTag 节点。结果,它只生成 OpenXmlUnknownElement 节点。

困难在于,它会为 smartTag 的所有后代节点生成 OpenXmlUnknownElement 节点。这意味着我不能简单地获取第一个子节点并将其转换为 运行.

获取文本(通过 InnerText 属性)很容易,但我还需要获取格式信息。

是否有任何相当简单的方法来处理这个问题?

目前,我最好的想法是编写一个删除智能标记节点的预处理器。


编辑

跟进 Cindy Meister 的评论。

我使用的是 OpenXml 2.7.2 版。正如 Cindy 所指出的,在 OpenXML 2.0 中有一个 class SmartTag运行。我不知道 class.

我在页面上找到了以下信息 What's new in the Open XML SDK 2.5 for Office

Smart tags

Because smart tags were deprecated in Office 2010, the Open XML SDK 2.5 doesn't support smart tag related Open XML elements. The Open XML SDK 2.5 still can process smart tag elements as unknown elements, however the Open XML SDK 2.5 Productivity Tool for Office validates those elements (see the following list) in Office document files as invalid tags.

所以听起来可能的解决方案是使用 OpenXML 2.0。

解决方案是使用 Linq to XML(或者 System.Xml 类,如果你更喜欢这些)删除 w:smartTag 元素,如下所示代码:

public class SmartTagTests
{
    private const string Xml =
        @"<w:document xmlns:w=""http://schemas.openxmlformats.org/wordprocessingml/2006/main"">
<w:body>
    <w:p>
        <w:smartTag w:uri=""urn:schemas-microsoft-com:office:smarttags"" w:element=""PersonName"">
            <w:r w:rsidRPr=""00BF444F"">
                <w:rPr>
                    <w:rFonts w:ascii=""Arial"" w:hAnsi=""Arial"" w:cs=""Arial""/>
                    <w:b/>
                    <w:bCs/>
                    <w:sz w:val=""40""/>
                    <w:szCs w:val=""40""/>
                </w:rPr>
                <w:t>ST</w:t>
            </w:r>
        </w:smartTag>
        <w:smartTag w:uri=""urn:schemas-microsoft-com:office:smarttags"" w:element=""PersonName"">
            <w:r w:rsidRPr=""00BF444F"">
                <w:rPr>
                    <w:rFonts w:ascii=""Arial"" w:hAnsi=""Arial"" w:cs=""Arial""/>
                    <w:b/>
                    <w:bCs/>
                    <w:sz w:val=""40""/>
                    <w:szCs w:val=""40""/>
                </w:rPr>
                <w:t>AR</w:t>
            </w:r>
        </w:smartTag>
        <w:r w:rsidRPr=""00BF444F"">
            <w:rPr>
                <w:rFonts w:ascii=""Arial"" w:hAnsi=""Arial"" w:cs=""Arial""/>
                <w:b/>
                <w:bCs/>
                <w:sz w:val=""40""/>
                <w:szCs w:val=""40""/>
            </w:rPr>
            <w:t xml:space=""preserve"">T</w:t>
        </w:r>
    </w:p>
</w:body>
</w:document>";

    [Fact]
    public void CanStripSmartTags()
    {
        // Say you have a WordprocessingDocument stored on a stream (e.g., read
        // from a file).
        using Stream stream = CreateTestWordprocessingDocument();

        // Open the WordprocessingDocument and inspect it using the strongly-
        // typed classes. This shows that we find OpenXmlUnknownElement instances
        // are found and only a single Run instance is recognized.
        using (WordprocessingDocument wordDocument = WordprocessingDocument.Open(stream, false))
        {
            MainDocumentPart part = wordDocument.MainDocumentPart;
            Document document = part.Document;

            Assert.Single(document.Descendants<Run>());
            Assert.NotEmpty(document.Descendants<OpenXmlUnknownElement>());
        }

        // Now, open that WordprocessingDocument to make edits, using Linq to XML.
        // Do NOT use the strongly typed classes in this context.
        using (WordprocessingDocument wordDocument = WordprocessingDocument.Open(stream, true))
        {
            // Get the w:document as an XElement and demonstrate that this
            // w:document contains w:smartTag elements.
            MainDocumentPart part = wordDocument.MainDocumentPart;
            string xml = ReadString(part);
            XElement document = XElement.Parse(xml);

            Assert.NotEmpty(document.Descendants().Where(d => d.Name.LocalName == "smartTag"));

            // Transform the w:document, stripping all w:smartTag elements and
            // demonstrate that the transformed w:document no longer contains
            // w:smartTag elements.
            var transformedDocument = (XElement) StripSmartTags(document);

            Assert.Empty(transformedDocument.Descendants().Where(d => d.Name.LocalName == "smartTag"));

            // Write the transformed document back to the part.
            WriteString(part, transformedDocument.ToString(SaveOptions.DisableFormatting));
        }

        // Open the WordprocessingDocument again and inspect it using the 
        // strongly-typed classes. This demonstrates that all Run instances
        // are now recognized.
        using (WordprocessingDocument wordDocument = WordprocessingDocument.Open(stream, false))
        {
            MainDocumentPart part = wordDocument.MainDocumentPart;
            Document document = part.Document;

            Assert.Equal(3, document.Descendants<Run>().Count());
            Assert.Empty(document.Descendants<OpenXmlUnknownElement>());
        }
    }

    /// <summary>
    /// Recursive, pure functional transform that removes all w:smartTag elements.
    /// </summary>
    /// <param name="node">The <see cref="XNode" /> to be transformed.</param>
    /// <returns>The transformed <see cref="XNode" />.</returns>
    private static object StripSmartTags(XNode node)
    {
        // We only consider elements (not text nodes, for example).
        if (!(node is XElement element))
        {
            return node;
        }

        // Strip w:smartTag elements by only returning their children.
        if (element.Name.LocalName == "smartTag")
        {
            return element.Elements();
        }

        // Perform the identity transform.
        return new XElement(element.Name, element.Attributes(),
            element.Nodes().Select(StripSmartTags));
    }

    private static Stream CreateTestWordprocessingDocument()
    {
        var stream = new MemoryStream();

        using var wordDocument = WordprocessingDocument.Create(stream, WordprocessingDocumentType.Document);
        MainDocumentPart part = wordDocument.AddMainDocumentPart();
        WriteString(part, Xml);

        return stream;
    }

    #region Generic Open XML Utilities

    private static string ReadString(OpenXmlPart part)
    {
        using Stream stream = part.GetStream(FileMode.Open, FileAccess.Read);
        using var streamReader = new StreamReader(stream);
        return streamReader.ReadToEnd();
    }

    private static void WriteString(OpenXmlPart part, string text)
    {
        using Stream stream = part.GetStream(FileMode.Create, FileAccess.Write);
        using var streamWriter = new StreamWriter(stream);
        streamWriter.Write(text);
    }

    #endregion
}

您还可以使用 PowerTools for Open XML,它提供了直接支持删除 w:smartTag 元素的标记简化器。