PHP - 加载所有动态内容后获取页面内容
PHP - Get page content after all dynamic content has been loaded
我尝试获取此页面的源代码:https://www.assetstore.unity3d.com/en/
我想为一个小项目解析右侧的 "Top Paid" 框,但是当我使用 file_get_contents 或以下代码时,我没有得到正确的源代码。
$cookie = tmpfile();
$userAgent = 'Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.31 (KHTML, like Gecko) Chrome/26.0.1410.64 Safari/537.31' ;
$ch = curl_init('https://www.assetstore.unity3d.com/en/');
$options = array(
CURLOPT_CONNECTTIMEOUT => 20 ,
CURLOPT_USERAGENT => $userAgent,
CURLOPT_AUTOREFERER => true,
CURLOPT_FOLLOWLOCATION => true,
CURLOPT_RETURNTRANSFER => true,
CURLOPT_COOKIEFILE => $cookie,
CURLOPT_COOKIEJAR => $cookie ,
CURLOPT_SSL_VERIFYPEER => 0 ,
CURLOPT_SSL_VERIFYHOST => 0 ,
CURLOPT_TIMEOUT => 10
);
curl_setopt_array($ch, $options);
$kl = curl_exec($ch);
curl_close($ch);
echo $kl;
?>
Returns:
<div id="assetstore">
<section id="content-panels">
<div id="adminarea"></div>
<div id="downloadarea" class="outer-content">
<div class="flex">
<div id="packagelistUI"></div>
<div id="packagelist"></div>
</div>
</div>
<div id="contentarea">
<div id="content" class="main">
<section id="mainContent"></section>
</div>
</div>
</section>
</div>
但是付费最高的盒子在 "mainContent" 部分内。我将如何获得此代码?
已解决
感谢 Pramod,现在这是我的代码:
<?php
// An example of using php-webdriver.
require_once('lib/__init__.php');
// start Firefox with 5 second timeout
$host = 'http://localhost:4444/wd/hub'; // this is the default
$capabilities = DesiredCapabilities::firefox();
$driver = RemoteWebDriver::create($host, $capabilities, 5000);
// navigate to 'http://docs.seleniumhq.org/'
$driver->get('https://www.assetstore.unity3d.com/en/');
// adding cookie
$driver->manage()->deleteAllCookies();
$driver->manage()->addCookie(array(
'name' => 'cookie_name',
'value' => 'cookie_value',
));
$cookies = $driver->manage()->getCookies();
// wait at most 10 seconds until at least one result is shown
$driver->wait(10)->until(
WebDriverExpectedCondition::presenceOfAllElementsLocatedBy(
WebDriverBy::className('top-list')
)
);
$sString = $driver->getPageSource();
// close the Firefox
$driver->quit();
print_r($sString);
我认为您尝试获取的页面正在使用 javascript 来加载内容。当我们使用 file_get_contents 时,javascript 将不会执行,因此不会加载页面内容。
我们可以用 php selenium 来阅读这些页面。
https://github.com/facebook/php-webdriver
见上文link.
谢谢
普拉莫德
我尝试获取此页面的源代码:https://www.assetstore.unity3d.com/en/
我想为一个小项目解析右侧的 "Top Paid" 框,但是当我使用 file_get_contents 或以下代码时,我没有得到正确的源代码。
$cookie = tmpfile();
$userAgent = 'Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.31 (KHTML, like Gecko) Chrome/26.0.1410.64 Safari/537.31' ;
$ch = curl_init('https://www.assetstore.unity3d.com/en/');
$options = array(
CURLOPT_CONNECTTIMEOUT => 20 ,
CURLOPT_USERAGENT => $userAgent,
CURLOPT_AUTOREFERER => true,
CURLOPT_FOLLOWLOCATION => true,
CURLOPT_RETURNTRANSFER => true,
CURLOPT_COOKIEFILE => $cookie,
CURLOPT_COOKIEJAR => $cookie ,
CURLOPT_SSL_VERIFYPEER => 0 ,
CURLOPT_SSL_VERIFYHOST => 0 ,
CURLOPT_TIMEOUT => 10
);
curl_setopt_array($ch, $options);
$kl = curl_exec($ch);
curl_close($ch);
echo $kl;
?>
Returns:
<div id="assetstore">
<section id="content-panels">
<div id="adminarea"></div>
<div id="downloadarea" class="outer-content">
<div class="flex">
<div id="packagelistUI"></div>
<div id="packagelist"></div>
</div>
</div>
<div id="contentarea">
<div id="content" class="main">
<section id="mainContent"></section>
</div>
</div>
</section>
</div>
但是付费最高的盒子在 "mainContent" 部分内。我将如何获得此代码?
已解决 感谢 Pramod,现在这是我的代码:
<?php
// An example of using php-webdriver.
require_once('lib/__init__.php');
// start Firefox with 5 second timeout
$host = 'http://localhost:4444/wd/hub'; // this is the default
$capabilities = DesiredCapabilities::firefox();
$driver = RemoteWebDriver::create($host, $capabilities, 5000);
// navigate to 'http://docs.seleniumhq.org/'
$driver->get('https://www.assetstore.unity3d.com/en/');
// adding cookie
$driver->manage()->deleteAllCookies();
$driver->manage()->addCookie(array(
'name' => 'cookie_name',
'value' => 'cookie_value',
));
$cookies = $driver->manage()->getCookies();
// wait at most 10 seconds until at least one result is shown
$driver->wait(10)->until(
WebDriverExpectedCondition::presenceOfAllElementsLocatedBy(
WebDriverBy::className('top-list')
)
);
$sString = $driver->getPageSource();
// close the Firefox
$driver->quit();
print_r($sString);
我认为您尝试获取的页面正在使用 javascript 来加载内容。当我们使用 file_get_contents 时,javascript 将不会执行,因此不会加载页面内容。
我们可以用 php selenium 来阅读这些页面。
https://github.com/facebook/php-webdriver
见上文link.
谢谢
普拉莫德