如何编写正则表达式以在 XML 文档中查找 CDATA 标签外的 HTML 标签

How to write a regex to find HTML tags outside CDATA tags in an XML document

我正在尝试导入一个 ONIX (XML) 文件,该文件由于描述性文本中的 HTML 标签而出现导入错误。在此特定文件中,一些描述性文本包含在 CDATA 标记中,但似乎有些不是。

我如何编写一个正则表达式来查找 HTML 个未包含在 CDATA 标签中的标签?

我正在使用 VB.NET 应用程序将数据导入 SQL 服务器数据库,但是此时我正在尝试在 Notepad++ 中编写正则表达式,看看有什么可能。我可以稍后将正则表达式合并到 VB 代码中。

下面是一些 XML 可以正确导入的示例:

<OtherText>
  <TextTypeCode>01</TextTypeCode>
  <TextFormat>02</TextFormat>
  <Text><![CDATA[More than simply a series of chapters on the theology of John's Gospel, <em>Jesus Is the Christ</em> relates each of John's teachings to his declared aim, expressed in John 20: 30-31: "Jesus did many other signs before his disciples, which have not been written in this book; but these have been written that you may believe that Jesus is the Christ, the Son of God, and that believing you may have life in his name." Indeed, each chapter in Morris's book takes up some facet or aspect of John's expressed aim.<br/><br/>For an age still asking the question "Who is Jesus?" Leon Morris argues convincingly that John's entire Gospel was written to show that the human Jesus is the Christ, or Messiah, as well as the Son of God. But it is Morris's firm conviction that John's purpose was evangelical as well as theological -- that is, John wrote his book so that readers might believe in Christ and as a result have eternal life.]]></Text>
</OtherText>

这是无法正确导入的 XML:

<OtherText>
  <TextTypeCode>01</TextTypeCode>
  <TextFormat>02</TextFormat>
  <Text>More than simply a series of chapters on the theology of John's Gospel, <em>Jesus Is the Christ</em> relates each of John's teachings to his declared aim, expressed in John 20: 30-31: "Jesus did many other signs before his disciples, which have not been written in this book; but these have been written that you may believe that Jesus is the Christ, the Son of God, and that believing you may have life in his name." Indeed, each chapter in Morris's book takes up some facet or aspect of John's expressed aim.<br/><br/>For an age still asking the question "Who is Jesus?" Leon Morris argues convincingly that John's entire Gospel was written to show that the human Jesus is the Christ, or Messiah, as well as the Son of God. But it is Morris's firm conviction that John's purpose was evangelical as well as theological -- that is, John wrote his book so that readers might believe in Christ and as a result have eternal life.</Text>
</OtherText>

现在,

<TextFormat>02</TextFormat> 

表示标签的内容是HTML,这样我就可以处理了。当我的标签没有正确标记时,问题就来了。我需要找到那些以便我可以更正它们。

这个正则表达式可以帮助你到达某个地方:

<\w+>(?!<![CDATA[)

我 运行 它在您在 Sublime Text 中提供的示例中,它只匹配 HTML 后面没有 CDATA 东西的标签。