如何解决 Groovy 的 XmlSlurper 由于 DOCTYPE 和 DTD 限制而拒绝解析 HTML 的问题？

Question

我正在尝试复制 HTML 覆盖率报告中的元素，因此覆盖率总数显示在报告的顶部和底部。

HTML 是这样开始的，我相信它是合式的：

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" lang="en" xml:lang="en">
  <head>
    <meta http-equiv="Content-Type" content="text/html;charset=UTF-8" />
    <link rel="stylesheet" href=".resources/report.css" type="text/css" />
    <link rel="shortcut icon" href=".resources/report.gif" type="image/gif" />
    <title>Unified coverage</title>
    <script type="text/javascript" src=".resources/sort.js"></script>
  </head>
  <body onload="initialSort(['breadcrumb', 'coveragetable'])">

Groovy 的 XmlSlurper 抱怨如下：

doc = new XmlSlurper( /* false, false, false */ ).parse("index.html")
[Fatal Error] index.html:1:48: DOCTYPE is disallowed when the feature "http://apache.org/xml/features/disallow-doctype-decl" set to true.
DOCTYPE is disallowed when the feature "http://apache.org/xml/features/disallow-doctype-decl" set to true.

启用文档类型：

doc = new XmlSlurper(false, false, true).parse("index.html")
[Fatal Error] index.html:1:148: External DTD: Failed to read external DTD 'xhtml1-strict.dtd', because 'http' access is not allowed due to restriction set by the accessExternalDTD property.
External DTD: Failed to read external DTD 'xhtml1-strict.dtd', because 'http' access is not allowed due to restriction set by the accessExternalDTD property.

doc = new XmlSlurper(false, true, true).parse("index.html")
[Fatal Error] index.html:1:148: External DTD: Failed to read external DTD 'xhtml1-strict.dtd', because 'http' access is not allowed due to restriction set by the accessExternalDTD property.
External DTD: Failed to read external DTD 'xhtml1-strict.dtd', because 'http' access is not allowed due to restriction set by the accessExternalDTD property.


doc = new XmlSlurper(true, true, true).parse("index.html")
External DTD: Failed to read external DTD 'xhtml1-strict.dtd', because 'http' access is not allowed due to restriction set by the accessExternalDTD property.

doc = new XmlSlurper(true, false, true).parse("index.html")
External DTD: Failed to read external DTD 'xhtml1-strict.dtd', because 'http' access is not allowed due to restriction set by the accessExternalDTD property.

所以我想我已经涵盖了所有选项。必须有一种方法可以在不诉诸正则表达式和冒着激怒托尼小马的情况下实现这一目标。

Answer 1

啧啧。

parser=new XmlSlurper()
parser.setFeature("http://apache.org/xml/features/disallow-doctype-decl", false) 
parser.setFeature("http://apache.org/xml/features/nonvalidating/load-external-dtd", false);
parser.parse(it)

Answer 2

即使您的 HTML 也恰好是格式正确的 XML，解析 HTML 的更通用的解决方案是使用真正的 HTML 解析器。我过去使用过 TagSoup 解析器，它可以很好地处理现实世界 HTML。

TagSoup提供了一个实现了javax.xml.parsers.SAXParser接口的解析器，可以在构造函数中提供给XmlSlurper。示例：

@Grab('org.ccil.cowan.tagsoup:tagsoup:1.2.1')

import org.ccil.cowan.tagsoup.Parser

def doc = new XmlSlurper(new Parser()).parse("index.html")

如何解决 Groovy 的 XmlSlurper 由于 DOCTYPE 和 DTD 限制而拒绝解析 HTML 的问题？

How to work around Groovy's XmlSlurper refusing to parse HTML due to DOCTYPE and DTD restrictions?

html

groovy

xmlslurper