简单的 HTML DOM 空格变成 class
Simple HTML DOM spaces into class
我正在使用 Simple HTML DOM 从网站获取元素,但是当 class 属性有空格时,我什么也得不到。
来源 HTML 来自 betaexplorer.com
<table id="table-type-2" class="stats-table stats-main table-2">
<tbody>
<tr class="odd glib-participant-ppjDR086" data-def-order="0">
<td class="rank col_rank no" title="">1.</td>
<td class="participant_name col_participant_name col_name"><span class="team_name_span"><a onclick="javascript:getUrlByWinType('/soccer/england/premier-league/teaminfo.php?team_id=ppjDR086');">Manchester United</a></span></td>
<td class="matches_played col_matches_played">4</td>
<td class="wins col_wins">4</td>
<td class="draws col_draws">0</td>
<td class="losses col_losses">0</td>
<td class="goals col_goals">14:0</td>
<td class="goals col_goals">12</td>
</tr>
<tr class="even glib-participant-hA1Zm19f" data-def-order="1">
<td class="rank col_rank no" title="">2.</td>
<td class="participant_name col_participant_name col_name"><span class="team_name_span"><a onclick="javascript:getUrlByWinType('/soccer/england/premier-league/teaminfo.php?team_id=hA1Zm19f');">Arsenal</a></span></td>
<td class="matches_played col_matches_played">4</td>
<td class="wins col_wins">4</td>
<td class="draws col_draws">0</td>
<td class="losses col_losses">0</td>
<td class="goals col_goals">11:3</td>
<td class="goals col_goals">12</td>
</tr>
<tr class="odd glib-participant-Wtn9Stg0" data-def-order="2">
<td class="rank col_rank no" title="">3.</td>
<td class="participant_name col_participant_name col_name"><span class="team_name_span"><a onclick="javascript:getUrlByWinType('/soccer/england/premier-league/teaminfo.php?team_id=Wtn9Stg0');">Manchester City</a></span></td>
<td class="matches_played col_matches_played">4</td>
<td class="wins col_wins">3</td>
<td class="draws col_draws">1</td>
<td class="losses col_losses">0</td>
<td class="goals col_goals">18:3</td>
<td class="goals col_goals">10</td>
</tr>
</tbody>
</table>
我的 PHP 代码使用 SimpleHtmlDom
<?php
include('../simple_html_dom.php');
function getHTML($url,$timeout)
{
$ch = curl_init($url); // initialize curl with given url
curl_setopt($ch, CURLOPT_USERAGENT, $_SERVER["HTTP_USER_AGENT"]); // set useragent
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true); // write the response to a variable
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true); // follow redirects if any
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout); // max. seconds to execute
curl_setopt($ch, CURLOPT_FAILONERROR, 1); // stop when it encounters an error
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, false);
return @curl_exec($ch);
}
$response=getHTML("http://www.betexplorer.com/soccer/england/premier-league/standings/?table=table&table_sub=home&ts=WOO1nDO2&dcheck=0",10);
$html = str_get_html($response);
$team = $html->find("span[class=team_name_span]/a");
$numbermatch = $html->find("td.matches_played.col_matches_played");
$wins = $html->find("td.wins.col_wins");
$draws = $html->find("td.draws.col_draws");
$losses = $html->find("td.losses.col_losses");
$goals = $html->find("td.goals.col_goals");
?>
<table border="1" width="100%">
<thead>
<tr>
<th>Team</th>
<th>MP</th>
<th>W</th>
<th>D</th>
<th>L</th>
<th>G</th>
</tr>
</thead>
<?php
foreach ($team as $match) {
echo "<tr>".
"<td class='first-cell'>".$match->innertext."</td> " .
"<td class='first-cell'>".$numbermatch->innertext."</td> " .
"<td class='first-cell'>".$wins->innertext."</td> " .
"<td class='first-cell'>".$draws->innertext."</td> " .
"<td class='first-cell'>".$losses->innertext."</td> " .
"<td class='first-cell'>".$goals->innertext."</td> " .
"</tr><br/>";
}
?>
</table>
所以,我只得到第一个值(因为 class 名称没有空格),但我无法得到其余的值
编辑: 我修正了 PHP 代码中的一个错误。再看看
EDIT2:这不是重复的,我试过那个解决方案但它不起作用
EDIT3: 我尝试使用 advanced_html_dom(它应该可以解决空格问题),但我没有得到任何东西(也是我得到的唯一一个) )
EDIT4: 在下面的屏幕中,您可以看到我想要得到什么以及我现在得到什么:
EDIT5
team.php
<?php
// START team.php
class Team
{
public $name, $matches, $wins, $draws, $losses, $goals;
public static function parseRow($row): ?self
{
$result = new self();
$result->name = $result->parseMatch($row, 'span.team_name_span a');
if (null === $result->name) {
return null; // couldn't even match the name, probably not a team row, skip it
}
$result->matches = $result->parseMatch($row, 'td.col_matches_played');
$result->wins = $result->parseMatch($row, 'td.col_wins');
$result->draws = $result->parseMatch($row, 'td.col_draws');
$result->losses = $result->parseMatch($row, 'td.col_losses');
$result->goals = $result->parseMatch($row, 'td.col_goals');
return $result;
}
private function parseMatch($row, $selector)
{
if (!empty($match = $row->find($selector, 0))) {
return $match->innertext;
}
return null;
}
}
// END team.php
?>
clas.php
<?php
include('../simple_html_dom.php');
include('../team.php');
function getHTML($url,$timeout)
{
$ch = curl_init($url); // initialize curl with given url
curl_setopt($ch, CURLOPT_USERAGENT, $_SERVER["HTTP_USER_AGENT"]); // set useragent
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true); // write the response to a variable
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true); // follow redirects if any
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout); // max. seconds to execute
curl_setopt($ch, CURLOPT_FAILONERROR, 1); // stop when it encounters an error
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, false);
return @curl_exec($ch);
}
$response=getHTML("http://www.betexplorer.com/soccer/england/premier-league/standings/?table=table&table_sub=home&ts=WOO1nDO2&dcheck=0",10);
$html = str_get_html($response);
// START DOM parsing block
$teams = [];
foreach($html->find('table.stats-table tr') as $row) {
$team = Team::parseRow($row); // load the row into a Team object if possible
// skipp this entry if it couldn't match the row
if (null !== $team) {
// what were actually doing here is just the OOP equivalent of:
// $teams[] = ['name' => $row->find('span.team_name_span a',0)->innertext, ...];
$teams[] = $team;
}
}
foreach($teams as $team) {
echo $team->name;
echo $team->matches;
}
// END DOM Parsing Block
?>
解法: http://phpfiddle.org/main/code/cq54-hta2
Class-名字没有空格,不要尝试匹配
SimpleHtmlDom 不支持这样的属性选择器。另外,您尝试匹配 class 就好像它在 class 名称中有空格一样。所以,而不是这个:
$wins = $html->find("td[class=wins col_wins]");
$draws = $html->find("td[class=draws col_draws]");
$losses = $html->find("td[class=losses col_losses]");
执行以下操作以匹配同时匹配两个 class 名称的 td 元素:
$wins = $html->find("td.wins.col_wins");
$draws = $html->find("td.draws.col_draws");
$losses = $html->find("td.losses.col_losses");
此外,HTML 标记不要求您匹配两个 class 来获取数据,您可以简单地执行以下操作:
$wins = $html->find("td.col_wins");
$draws = $html->find("td.col_draws");
$losses = $html->find("td.col_losses");
获取重复的选择器(遍历行)。
您要提取的是 table 行中的数据数组。更具体地说,看起来像这样:
$teams = [
['Arsenal', matches, wins, ...],
['Liverpool', matches, wins, ...],
...
];
这意味着您需要 运行 对 table 的每一行进行相同的数据提取。 SimpleHtmlDom 通过 jQuery-like find
方法使这变得容易,可以从任何匹配的元素调用这些方法。
完整的解决方案
这个解决方案实际上定义了一个 Team
对象来加载每一行的数据。应该使未来的调整更简单。
这里要注意的重要一点是,首先我们循环遍历每个 table 行作为 $row
,并从 $row->find([selector])
.[=20= 收集球队和号码]
// START team.php
class Team
{
public $name, $matches, $wins, $draws, $losses, $goals;
public function __construct($row)
{
$this->name = $this->parseMatch($row, 'span.team_name_span a');
if (null === $this->name) {
return; // couldn't even match the name, probably not a team row, skip it
}
$this->matches = $this->parseMatch($row, 'td.col_matches_played');
$this->wins = $this->parseMatch($row, 'td.col_wins');
$this->draws = $this->parseMatch($row, 'td.col_draws');
$this->losses = $this->parseMatch($row, 'td.col_losses');
$this->goals = $this->parseMatch($row, 'td.col_goals');
}
private function parseMatch($row, $selector)
{
if (!empty($match = $row->find($selector, 0))) {
return $match->innertext;
}
return null;
}
public function isValid()
{
return null !== $this->name;
}
public function getMatchData() //example
{
return "<br><b>". $this->wins .' : '. $this->matches . "</b>";
}
}
// END team.php
// START DOM parsing block
$teams = [];
foreach($html->find('table.stats-table tr') as $row) {
$team = new Team($row); // load the row into a Team object if possible
// skipp this entry if it couldn't match the row
if ($team->isValid()) {
// what were actually doing here is just the OOP equivalent of:
// $teams[] = ['name' => $row->find('span.team_name_span a',0)->innertext, ...];
$teams[] = $team;
}
}
foreach($teams as $team) {
echo "<h1>".$team->name."</h1>";
echo $team->losses;
echo $team->getMatchData();
}
// END DOM Parsing Block
我正在使用 Simple HTML DOM 从网站获取元素,但是当 class 属性有空格时,我什么也得不到。
来源 HTML 来自 betaexplorer.com
<table id="table-type-2" class="stats-table stats-main table-2">
<tbody>
<tr class="odd glib-participant-ppjDR086" data-def-order="0">
<td class="rank col_rank no" title="">1.</td>
<td class="participant_name col_participant_name col_name"><span class="team_name_span"><a onclick="javascript:getUrlByWinType('/soccer/england/premier-league/teaminfo.php?team_id=ppjDR086');">Manchester United</a></span></td>
<td class="matches_played col_matches_played">4</td>
<td class="wins col_wins">4</td>
<td class="draws col_draws">0</td>
<td class="losses col_losses">0</td>
<td class="goals col_goals">14:0</td>
<td class="goals col_goals">12</td>
</tr>
<tr class="even glib-participant-hA1Zm19f" data-def-order="1">
<td class="rank col_rank no" title="">2.</td>
<td class="participant_name col_participant_name col_name"><span class="team_name_span"><a onclick="javascript:getUrlByWinType('/soccer/england/premier-league/teaminfo.php?team_id=hA1Zm19f');">Arsenal</a></span></td>
<td class="matches_played col_matches_played">4</td>
<td class="wins col_wins">4</td>
<td class="draws col_draws">0</td>
<td class="losses col_losses">0</td>
<td class="goals col_goals">11:3</td>
<td class="goals col_goals">12</td>
</tr>
<tr class="odd glib-participant-Wtn9Stg0" data-def-order="2">
<td class="rank col_rank no" title="">3.</td>
<td class="participant_name col_participant_name col_name"><span class="team_name_span"><a onclick="javascript:getUrlByWinType('/soccer/england/premier-league/teaminfo.php?team_id=Wtn9Stg0');">Manchester City</a></span></td>
<td class="matches_played col_matches_played">4</td>
<td class="wins col_wins">3</td>
<td class="draws col_draws">1</td>
<td class="losses col_losses">0</td>
<td class="goals col_goals">18:3</td>
<td class="goals col_goals">10</td>
</tr>
</tbody>
</table>
我的 PHP 代码使用 SimpleHtmlDom
<?php
include('../simple_html_dom.php');
function getHTML($url,$timeout)
{
$ch = curl_init($url); // initialize curl with given url
curl_setopt($ch, CURLOPT_USERAGENT, $_SERVER["HTTP_USER_AGENT"]); // set useragent
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true); // write the response to a variable
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true); // follow redirects if any
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout); // max. seconds to execute
curl_setopt($ch, CURLOPT_FAILONERROR, 1); // stop when it encounters an error
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, false);
return @curl_exec($ch);
}
$response=getHTML("http://www.betexplorer.com/soccer/england/premier-league/standings/?table=table&table_sub=home&ts=WOO1nDO2&dcheck=0",10);
$html = str_get_html($response);
$team = $html->find("span[class=team_name_span]/a");
$numbermatch = $html->find("td.matches_played.col_matches_played");
$wins = $html->find("td.wins.col_wins");
$draws = $html->find("td.draws.col_draws");
$losses = $html->find("td.losses.col_losses");
$goals = $html->find("td.goals.col_goals");
?>
<table border="1" width="100%">
<thead>
<tr>
<th>Team</th>
<th>MP</th>
<th>W</th>
<th>D</th>
<th>L</th>
<th>G</th>
</tr>
</thead>
<?php
foreach ($team as $match) {
echo "<tr>".
"<td class='first-cell'>".$match->innertext."</td> " .
"<td class='first-cell'>".$numbermatch->innertext."</td> " .
"<td class='first-cell'>".$wins->innertext."</td> " .
"<td class='first-cell'>".$draws->innertext."</td> " .
"<td class='first-cell'>".$losses->innertext."</td> " .
"<td class='first-cell'>".$goals->innertext."</td> " .
"</tr><br/>";
}
?>
</table>
所以,我只得到第一个值(因为 class 名称没有空格),但我无法得到其余的值
编辑: 我修正了 PHP 代码中的一个错误。再看看
EDIT2:这不是重复的,我试过那个解决方案但它不起作用
EDIT3: 我尝试使用 advanced_html_dom(它应该可以解决空格问题),但我没有得到任何东西(也是我得到的唯一一个) )
EDIT4: 在下面的屏幕中,您可以看到我想要得到什么以及我现在得到什么:
EDIT5
team.php
<?php
// START team.php
class Team
{
public $name, $matches, $wins, $draws, $losses, $goals;
public static function parseRow($row): ?self
{
$result = new self();
$result->name = $result->parseMatch($row, 'span.team_name_span a');
if (null === $result->name) {
return null; // couldn't even match the name, probably not a team row, skip it
}
$result->matches = $result->parseMatch($row, 'td.col_matches_played');
$result->wins = $result->parseMatch($row, 'td.col_wins');
$result->draws = $result->parseMatch($row, 'td.col_draws');
$result->losses = $result->parseMatch($row, 'td.col_losses');
$result->goals = $result->parseMatch($row, 'td.col_goals');
return $result;
}
private function parseMatch($row, $selector)
{
if (!empty($match = $row->find($selector, 0))) {
return $match->innertext;
}
return null;
}
}
// END team.php
?>
clas.php
<?php
include('../simple_html_dom.php');
include('../team.php');
function getHTML($url,$timeout)
{
$ch = curl_init($url); // initialize curl with given url
curl_setopt($ch, CURLOPT_USERAGENT, $_SERVER["HTTP_USER_AGENT"]); // set useragent
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true); // write the response to a variable
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true); // follow redirects if any
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout); // max. seconds to execute
curl_setopt($ch, CURLOPT_FAILONERROR, 1); // stop when it encounters an error
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, false);
return @curl_exec($ch);
}
$response=getHTML("http://www.betexplorer.com/soccer/england/premier-league/standings/?table=table&table_sub=home&ts=WOO1nDO2&dcheck=0",10);
$html = str_get_html($response);
// START DOM parsing block
$teams = [];
foreach($html->find('table.stats-table tr') as $row) {
$team = Team::parseRow($row); // load the row into a Team object if possible
// skipp this entry if it couldn't match the row
if (null !== $team) {
// what were actually doing here is just the OOP equivalent of:
// $teams[] = ['name' => $row->find('span.team_name_span a',0)->innertext, ...];
$teams[] = $team;
}
}
foreach($teams as $team) {
echo $team->name;
echo $team->matches;
}
// END DOM Parsing Block
?>
解法: http://phpfiddle.org/main/code/cq54-hta2
Class-名字没有空格,不要尝试匹配
SimpleHtmlDom 不支持这样的属性选择器。另外,您尝试匹配 class 就好像它在 class 名称中有空格一样。所以,而不是这个:
$wins = $html->find("td[class=wins col_wins]");
$draws = $html->find("td[class=draws col_draws]");
$losses = $html->find("td[class=losses col_losses]");
执行以下操作以匹配同时匹配两个 class 名称的 td 元素:
$wins = $html->find("td.wins.col_wins");
$draws = $html->find("td.draws.col_draws");
$losses = $html->find("td.losses.col_losses");
此外,HTML 标记不要求您匹配两个 class 来获取数据,您可以简单地执行以下操作:
$wins = $html->find("td.col_wins");
$draws = $html->find("td.col_draws");
$losses = $html->find("td.col_losses");
获取重复的选择器(遍历行)。
您要提取的是 table 行中的数据数组。更具体地说,看起来像这样:
$teams = [
['Arsenal', matches, wins, ...],
['Liverpool', matches, wins, ...],
...
];
这意味着您需要 运行 对 table 的每一行进行相同的数据提取。 SimpleHtmlDom 通过 jQuery-like find
方法使这变得容易,可以从任何匹配的元素调用这些方法。
完整的解决方案
这个解决方案实际上定义了一个 Team
对象来加载每一行的数据。应该使未来的调整更简单。
这里要注意的重要一点是,首先我们循环遍历每个 table 行作为 $row
,并从 $row->find([selector])
.[=20= 收集球队和号码]
// START team.php
class Team
{
public $name, $matches, $wins, $draws, $losses, $goals;
public function __construct($row)
{
$this->name = $this->parseMatch($row, 'span.team_name_span a');
if (null === $this->name) {
return; // couldn't even match the name, probably not a team row, skip it
}
$this->matches = $this->parseMatch($row, 'td.col_matches_played');
$this->wins = $this->parseMatch($row, 'td.col_wins');
$this->draws = $this->parseMatch($row, 'td.col_draws');
$this->losses = $this->parseMatch($row, 'td.col_losses');
$this->goals = $this->parseMatch($row, 'td.col_goals');
}
private function parseMatch($row, $selector)
{
if (!empty($match = $row->find($selector, 0))) {
return $match->innertext;
}
return null;
}
public function isValid()
{
return null !== $this->name;
}
public function getMatchData() //example
{
return "<br><b>". $this->wins .' : '. $this->matches . "</b>";
}
}
// END team.php
// START DOM parsing block
$teams = [];
foreach($html->find('table.stats-table tr') as $row) {
$team = new Team($row); // load the row into a Team object if possible
// skipp this entry if it couldn't match the row
if ($team->isValid()) {
// what were actually doing here is just the OOP equivalent of:
// $teams[] = ['name' => $row->find('span.team_name_span a',0)->innertext, ...];
$teams[] = $team;
}
}
foreach($teams as $team) {
echo "<h1>".$team->name."</h1>";
echo $team->losses;
echo $team->getMatchData();
}
// END DOM Parsing Block