使用 Perl 从字符串中删除除 HTML 锚链接之外的所有内容

Using Perl to strip everything from a string except HTML Anchor Links

使用 Perl,我如何使用正则表达式来获取一个随机 HTML 的字符串,其中一个 HTML link 带有锚点,如下所示:

  <a href="http://example.com" target="_blank">Whatever Example</a>

它只留下那个并摆脱其他一切?无论带有 title= 或 style= 或其他。 它离开锚点:"Whatever Example" 和 ?

如果您需要一个简单的正则表达式解决方案,一个天真的方法可能是:

my @anchors = $text =~ m@(<a[^>]*?>.*?</a>)@gsi;

然而,正如@dan1111 所提到的,正则表达式不是为 various reasons.

解析 HTML 的正确工具

如果您需要可靠的解决方案,请寻找 HTML parser module

您可以利用流解析器,例如 HTML::TokeParser::Simple:

#!/usr/bin/env perl

use strict;
use warnings;

use HTML::TokeParser::Simple;

my $html = <<EO_HTML;

Using Perl, how can I use a regex to take a string that has random HTML in it
with one HTML link with anchor, like this:

   <a href="http://example.com" target="_blank">Whatever <i>Interesting</i> Example</a>

       and it leave ONLY that and get rid of everything else? No matter what
   was inside the href attribute with the <a, like title=, or style=, or
   whatever. and it leave the anchor: "Whatever Example" and the </a>?
EO_HTML

my $parser = HTML::TokeParser::Simple->new(string => $html);

while (my $tag = $parser->get_tag('a')) {
    print $tag->as_is, $parser->get_text('/a'), "</a>\n";
}

输出:

$ ./whatever.pl
<a href="http://example.com" target="_blank">Whatever Interesting Example</a>