是什么让我的网络爬虫变慢了?
What's making my web crawler slow?
Hi! I'm making a web crawler using php. At first when I'm just getting
the web content (using the function get_links), it was fast but then
after all the other function have been added it became painfully slow.
My web crawler is literally crawling. When I check the Network monitor
in Inspector I got no response for the request at all. What could be
the problem? is internet speed a factor? why is it taking too long to
load? if you may ask, My flatform is Ubuntu 15.4 and I'm just using a
localhost as a server. Here is my code.
<?php
error_reporting(E_ALL);
ini_set('display_errors', '1');
$to_crawl = "http://bestspace.com";
$c = array();
$i = 0;
function get_links($url) {
global $c;
$input = @file_get_contents($url);
$regexp = "<a\s[^>]*href=(\"??)([^\">]*?)\1[^>]*>(.*)<\/a>";
preg_match_all("/$regexp/siU", $input, $matches);
$base_url = parse_url($url, PHP_URL_HOST);
$l = $matches[2];
foreach($l as $link) {
if (strpos($link, "#")) {
$link = substr($link, 0, strpos($link, "#"));
}
if (substr($link,0,1) == ".") {
$link = substr($link, 1);
}
if (substr($link,0,7) == "http://") {
$link = $link;
}
else if (substr($link,0,8) == "https://") {
$link = $link;
}
else if (substr($link,0,2) == "//") {
$link = substr($link, 2);
}
else if (substr($link,0,1) == "#") {
$link = $url;
}
else if (substr($link,0,7) == "mailto:") {
$link = "[".$link."]";
}
else {
if (substr($link, 0, 1) != "/") {
$link = $base_url."/".$link;
}
else {
$link = $base_url.$link;
}
}
if (substr($link, 0, 7) != "http://" && substr($link, 0, 8) != "https://" && substr($link, 0, 1) != "[") {
if (substr($link, 0, 8) == "https://") {
$link = "https://".$link;
}
else {
$link = "http://".$link;
}
}
//echo $link."<br/>";
if (!in_array($link, $c)) {
array_push($c, $link);
}
}
}
get_links($to_crawl);
//echo "ARRAY <br />";
foreach ($c as $page) {
# code...
get_links($page);
//echo $page."<br />";
}
function get_domain($url)
{
$host = @parse_url($url, PHP_URL_HOST);
if (!$host)
$host = $url;
if (substr($host, 0, 4) == "www.")
$host = substr($host, 4);
if (strlen($host) > 50)
$host = substr($host, 0, 47) . '...';
return $host;
}
function content_type($url) {
$ch = curl_init($url);
curl_setopt ($ch, CURLOPT_RETURNTRANSFER, 1) ;
$content = curl_exec($ch);
if(!curl_errno($ch))
{
$info = curl_getinfo($ch, CURLINFO_CONTENT_TYPE);
}
return $info;
curl_close($ch);
}
echo "<table class = 'table table-striped'>";
echo "<tbody>";
echo "<tr>";
echo "<th>#</th><th>DOMAIN NAME</th><th>CATEGORY</th><th>URL</th>";
echo "</tr>";
foreach ($c as $page) {
$i++;
echo "<tr>";
echo "<td >".$i."</td><td>".get_domain($to_crawl)."</td><td>".content_type($page)."</td><td>".$page;
echo "</td>";
echo "</tr>";
}
echo "</tbody>";
echo "</table>";
?>
我将发送给您 link 以彻底调查它,因为它就像 jquery 用于网络。这是用于网络抓取的高质量和平软件,相信我,您的工作效率会更高!
看看例子,稍后谢谢我:)
http://simplehtmldom.sourceforge.net/
hth, k
Hi! I'm making a web crawler using php. At first when I'm just getting the web content (using the function get_links), it was fast but then after all the other function have been added it became painfully slow. My web crawler is literally crawling. When I check the Network monitor in Inspector I got no response for the request at all. What could be the problem? is internet speed a factor? why is it taking too long to load? if you may ask, My flatform is Ubuntu 15.4 and I'm just using a localhost as a server. Here is my code.
<?php
error_reporting(E_ALL);
ini_set('display_errors', '1');
$to_crawl = "http://bestspace.com";
$c = array();
$i = 0;
function get_links($url) {
global $c;
$input = @file_get_contents($url);
$regexp = "<a\s[^>]*href=(\"??)([^\">]*?)\1[^>]*>(.*)<\/a>";
preg_match_all("/$regexp/siU", $input, $matches);
$base_url = parse_url($url, PHP_URL_HOST);
$l = $matches[2];
foreach($l as $link) {
if (strpos($link, "#")) {
$link = substr($link, 0, strpos($link, "#"));
}
if (substr($link,0,1) == ".") {
$link = substr($link, 1);
}
if (substr($link,0,7) == "http://") {
$link = $link;
}
else if (substr($link,0,8) == "https://") {
$link = $link;
}
else if (substr($link,0,2) == "//") {
$link = substr($link, 2);
}
else if (substr($link,0,1) == "#") {
$link = $url;
}
else if (substr($link,0,7) == "mailto:") {
$link = "[".$link."]";
}
else {
if (substr($link, 0, 1) != "/") {
$link = $base_url."/".$link;
}
else {
$link = $base_url.$link;
}
}
if (substr($link, 0, 7) != "http://" && substr($link, 0, 8) != "https://" && substr($link, 0, 1) != "[") {
if (substr($link, 0, 8) == "https://") {
$link = "https://".$link;
}
else {
$link = "http://".$link;
}
}
//echo $link."<br/>";
if (!in_array($link, $c)) {
array_push($c, $link);
}
}
}
get_links($to_crawl);
//echo "ARRAY <br />";
foreach ($c as $page) {
# code...
get_links($page);
//echo $page."<br />";
}
function get_domain($url)
{
$host = @parse_url($url, PHP_URL_HOST);
if (!$host)
$host = $url;
if (substr($host, 0, 4) == "www.")
$host = substr($host, 4);
if (strlen($host) > 50)
$host = substr($host, 0, 47) . '...';
return $host;
}
function content_type($url) {
$ch = curl_init($url);
curl_setopt ($ch, CURLOPT_RETURNTRANSFER, 1) ;
$content = curl_exec($ch);
if(!curl_errno($ch))
{
$info = curl_getinfo($ch, CURLINFO_CONTENT_TYPE);
}
return $info;
curl_close($ch);
}
echo "<table class = 'table table-striped'>";
echo "<tbody>";
echo "<tr>";
echo "<th>#</th><th>DOMAIN NAME</th><th>CATEGORY</th><th>URL</th>";
echo "</tr>";
foreach ($c as $page) {
$i++;
echo "<tr>";
echo "<td >".$i."</td><td>".get_domain($to_crawl)."</td><td>".content_type($page)."</td><td>".$page;
echo "</td>";
echo "</tr>";
}
echo "</tbody>";
echo "</table>";
?>
我将发送给您 link 以彻底调查它,因为它就像 jquery 用于网络。这是用于网络抓取的高质量和平软件,相信我,您的工作效率会更高!
看看例子,稍后谢谢我:)
http://simplehtmldom.sourceforge.net/
hth, k