删除大 Java 字符串的部分（它包含 HTML 源）

Question

在我的应用程序中，我将 HTML 页面源加载到一个字符串中。在此 HTML 中，我想删除在特定 HTML 评论之间的某些内容。

例如：

//the entire String will be HTML source like this, of the entire page
<div id="someid">
    <a href="#">Some text</a>
    <!-- this_tag_start 123 -->
    <p> This text between the tags to be removed </p>
    <!-- this_tag_end 123 -->
    <a href="#">Some text</a>
</div>

那个this_tag_start 123和对应的"end"是我们的服务器生成的。 123 数字会有所不同。

在我的程序中，我有一个包含整个 HTML 源的字符串。我想删除这两个评论标签之间的文本（评论标签是否保留并不重要）。这些 html 评论标签可以在整个 HTML 来源中出现不同的时间。

现在我正在使用这个正则表达式删除内容：

htmlString = htmlString.replaceAll(
    "<!-- this_tag_start(.*?)<!-- this_tag_end[\s\d]+-->",""
    );

这有效并正确删除了这些评论标签以及开始和结束标签之间的内容。但是，感觉这不是一个优雅的解决方案。应该有一个 better/faster 的方法，对吧？

如果重要的话，字符串是由 WebDriver 的 getPageSource() 方法生成的。

Answer 1

1。优雅

However, it doesn't feel like it's an elegant solution.

这是原始正则表达式的两个变体：

变体 1

(?s)\s*<!-- this_tag_start([\s\d]+)-->.+?<!-- this_tag_end-->\s*

DEMO

此变体使用 id 的反向引用。我看到的一个缺点是这种变体只允许 id 为空格。只要您控制评论，这不是问题。

变体 2

(?s)\s*<!-- this_tag_start\s+(\d+)\s*-->.+?<!-- this_tag_end\s+\s*-->\s*

DEMO

此变体再次使用 id 的反向引用。但是，它更明确地说明了 id 的预期方式：一个或多个空格，一个或多个数字后跟零个或多个空格。

2。速度

There should be a better/faster way to do it, right?

在内部，String#replaceAll 方法调用 Pattern#compile。众所周知，模式编译慢.

我会缓存编译结果以加快替换速度。方法如下：

public class MyCrawler {
   // Compile once, run multiple times
   private static final Matcher COMMENT_REMOVER = Pattern.compile("the regex here...").matcher("");

   public void doCrawl() {
      String htmlString = loadHtmlSource();

      htmlString = COMMENT_REMOVER.reset(htmlString).replaceAll("");
   }

   ...
}

删除大 Java 字符串的部分（它包含 HTML 源）

Removing sections of a large Java String (it contains HTML source)

java

regex

string

replace

replaceall

1。优雅

变体 1

变体 2

2。速度