简单 HTML DOM 无法获取文件

Question

我不知道解决方案是什么。我根本无法得到这个 Charizard 的 html 文件，即使 link 是正确的，我也没有得到任何回应。 Bulbasaur 工作正常，但我想要这个可爱的喷火龙...

include("simple_html_dom.php");
$html = file_get_html('https://bulbapedia.bulbagarden.net/wiki/Charizard_(Pok%C3%A9mon)');
$html2 = file_get_html('https://bulbapedia.bulbagarden.net/wiki/Bulbasaur_(Pok%C3%A9mon)');
echo $html;
echo $html2;

此页面是否有任何保护措施，或者喷火龙只是更难捕捉？如果你能帮助我，我将不胜感激。

乔纳斯 :)

Answer 1

由于我没有在 php 文档中找到 file_get_html()，也许您更喜欢使用 file_get_contents(url)。

Answer 2

我建议 alternative library 因为我不认为你会用 simple_html_dom:

include 'advanced_html_dom.php';
$html = file_get_html('https://bulbapedia.bulbagarden.net/wiki/Charizard_(Pok%C3%A9mon)');

echo $html->find('h1', 0)->text() . PHP_EOL;
echo $html->find('big a[title*="Pokédex number"]', 0)->text() . PHP_EOL;

这给出：

Charizard (Pokémon)
#006

Answer 3

这里有两个问题：

从此 URL 中获取的内容长度超过 MAX_FILE_SIZE（在 simple_html_dom.php 中定义）
评论中指出的错误（https://github.com/sunra/php-simple-html-dom-parser/issues/37). This bug seems to be resolved in the forked repository that is maintained on github but it still exists in original version（似乎不再维护）

要解决第一个问题，请编辑 simple_html_dom.php 并更改 define('MAX_FILE_SIZE', 600000); 以使用更大的数字。

作为第二个问题的解决方法，将正确的参数传递给 file_get_html，我的意思是将 0 传递给 $offset:

$html = file_get_html('https://bulbapedia.bulbagarden.net/wiki/Charizard_(Pok%C3%A9mon)',
false,
null,
0); // this last one is the offset

var_dump($html);

或者您可以使用 forked version of the library。

简单 HTML DOM 无法获取文件

Simple HTML DOM cannot get file

php

simple-html-dom