如何从 HTML 中提取这些 table 行？

Question

我想通过匹配后输入团队来查找团队分数，然后在团队名称后显示下一个数字或单词以便能够显示分数。

我希望这适用于左侧和右侧球队（左侧：FT Crystal Palace 1 - 右侧：1 Leicester）

主要问题是从我们匹配的输入团队中找到团队分数。

#!/usr/bin/perl -wT

use strict;
use warnings;
use CGI;
use HTML::TableExtract;
use LWP::UserAgent;

my $cgi = CGI->new;
my $ua = LWP::UserAgent->new;


my $input = $cgi->param("team");
my $response ='';


if ($input) {
my $req = HTTP::Request->new(GET => "");
my $res = $ua->request($req);
my $html = $res->decoded_content;

$html =~ s/<span.*?<\/span>//gs;
$html =~ s/<script.*?<\/script>//gs;
$html =~ s/<td(?=[^>]*class="events-button button first-occur")[^>]*>//gs;

my $table = HTML::TableExtract->new();
$table->parse($html);

print "Content-type: text/html\n\n";

foreach my $row ($table->rows) {
     my $output =  join("", @$row);
     $output =~ s/\R//g;
     print $output;
     
    # Find matching team (Its a must to find team after word FT)
    if ($output =~ m/$input/) {
    
    # After that team the next is score (Gets team scores)
    my $score = $output =~ /$input\s*?(\S+)/;
    
    $response ="Found Team And Score is: $score";
    } else {
    $response ="Can't find team";
  }
}
}


print <<EOF;

<!DOCTYPE html>
<html>
<body>

<h1>Test</h1>

<form method="post">
  <label for="team">Enter Team:</label>
  <input type="text" name="team"><br><br>
  <input type="submit" value="Find team">
</form>

 <h4>$response</h4>

</body>
</html>
EOF

Answer 1

我喜欢将 Mojolicious 用于这些事情，因为我需要的一切都是内置的。使用一点点 CSS 选择器魔法，您可以轻松地将所需数据归零，而无需手动字符串处理。

查看源代码，我看到该页面在 table 中有 class matches_new 的结果，在 table 中，有趣的行有 class match。在这些行中，有趣的 table 单元格位于 classes team-a、team-b 和 score-time 中（尽管“分数”也可能是游戏状态).

请注意，在您单击团队名称之前，许多行不会加载它们的数据。那是 JavaScript 在起作用，而 Perl 网络抓取库不会在这方面为您提供帮助。我不知道为什么英超联赛已经为我填写了数据（对每个人都是这样吗？）或者我们是否可以一直这样。

但让我们提取该数据。

基本的 Mojo HTML-解析过程是发出请求然后获取 DOM（文档对象模型）。其中 DOM、find 和 at 定位特定部分。 find 获取匹配其 CSS 选择器的所有内容，at 获取下一个匹配项。

所以，我 find 所有正确的行，每一行都是另一个（较小的）DOM 对象，我可以进一步探索以获得球队和得分。

find returns a Mojo::Collection （数组的奇特接口）我调用 map 依次处理每一行并从每一行中提取所有文本table 单元格（并且 trim 删除前导和尾随空格）。 all_text 获取所有内容，包括子节点中的文本。

#!perl
use v5.10;

use Mojo::UserAgent;
use Mojo::Util qw(dumper trim);

my $url = 'https://int.soccerway.com/matches/2020/12/28/';

my $ua = Mojo::UserAgent->new;

my @results = $ua
    ->get( $url )
    ->result
    ->dom
    ->find( 'table.matches_new tr.match' )
    ->map( sub {
        my $row = $_;
        my @results =
            map { trim( $row->at( $_ )->all_text ) }
            qw( td.team-a td.team-b td.score-time );
        return \@results;
        } )
    ->to_array;

say dumper( @results );

这是输出，只有足球比赛：

[
  [
    "Everton",
    "Manchester City",
    "PSTP"
  ],
  [
    "Crystal Palace",
    "Leicester City",
    "1 - 1"
  ],
  [
    "Chelsea",
    "Aston Villa",
    "1 - 1"
  ]
]

严格来说，您使用 LWP 或 HTTP::Request 并没有“错”，但它们没有提供处理数据的强大工具。 HTML::TableExtract 很好，但是 table 它变得很长，您不需要它的大部分行。在我的职业生涯中，我编写了很多很多像您提供的程序一样的程序：获取源代码，尽可能删除不相关的 HTML，然后处理剩下的内容。将所有工具都放在一个包中，所有的东西都设计为在相同的环境中协同工作，这要好得多。

使用选择器准确定位您想要的内容要容易得多，不那么脆弱，代码也少得多。您可以在 Mojo docs and I wrote about in it detail with lots of examples in Mojo Web Clients.

中看到这一点

此外，只要有 HTML，就可以只使用 Mojo::DOM，无论来源如何。我在 Extracting from HTML with Mojo::DOM 中为 Perl.com 写了这个。对于你来说，将你的解码内容交给 Mojo::DOM 并进行我已经完成的相同处理：

my $html = HTTP::Request->new( ... )->request( ... )->decoded_content;

my $dom = Mojo::DOM->new( $html );
my @results = $dom->find( ... )->...;

如何从 HTML 中提取这些 table 行？

How can I extract these table rows from HTML?

perl

parsing