如何以纯文本形式提取网页上的所有 URL(链接)?
How extract all URLs (links) on a web page in plain text?
所以基本上我想从网页中提取所有 URL,即使它们不是可点击的链接。
例如页面源可能是:
<html>
<title>Random Website I am Crawling</title>
<body>
Click <a href="http://clicklink.com">here</a> for foobar
Another site is http://foobar.com
</body>
</html>
我希望显示两个网址,
http://clicklink.com and http://foobar.com
我也不希望包含。
我当前的脚本抓取了 url,但似乎也抓取了一堆其他垃圾,使链接可点击且无法存储在数据库中。
这是我当前的代码。
<?php
$db = new PDO('mysql:host=localhost;dbname=crawler;charset=utf8', 'crawler', '***', array(PDO::ATTR_EMULATE_PREPARES => false,
PDO::ATTR_ERRMODE => PDO::ERRMODE_EXCEPTION));
$url="http://www.frozencpu.com/";
$data=file_get_contents($url);
$data = strip_tags($data,"<a>");
$d = preg_split("/<\/a>/",$data);
foreach ( $d as $k=>$u ){
if( strpos($u, "<a href=") !== FALSE ){
//echo $u;
//echo "<BR>";
$u = preg_replace("/.*<a\s+href=\"/sm","",$u);
$u = preg_replace("/\".*/","",$u);
//echo $u;
//echo "<BR>";
$db->exec("INSERT INTO urls(url, crawled) VALUES('$u', '0')");
}
}
?>
这是一个示例输出
http://www.facebook.com/pages/FrozenCPUcom/351841771499<BR>http://twitter.com/FrozenCPU<BR>/rss/frozencpu.rss<BR>http://www.frozencpu.com/index.html?id=CR9RnD2g<BR>http://www.frozencpu.com/index.html?id=CR9RnD2g<BR>http://www.frozencpu.com/cart.html?id=CR9RnD2g<BR>http://www.frozencpu.com/account.html?id=CR9RnD2g<BR>http://www.frozencpu.com/tracking.html?id=CR9RnD2g<BR>http://www.frozencpu.com/help_center.html?id=CR9RnD2g<BR>http://www.frozencpu.com/manage_carts.html?id=CR9RnD2g<BR>
*到这里为止似乎还不错
Then it just junks up big time
<a href='http://www.frozencpu.com/advanced_search.html?id=CR9RnD2g' class=small>Advanced Search<BR>http://www.frozencpu.com/brands/shop_by_brand.html?id=CR9RnD2g<BR>http://www.frozencpu.com/shop_category.html?id=CR9RnD2g<BR>http://www.frozencpu.com/cat/l1/g30/Liquid_Cooling.html?id=CR9RnD2g<BR>http://www.frozencpu.com/cat/l1/g57/EK_Products.html?id=CR9RnD2g<BR>http://www.frozencpu.com/cat/l1/g59/XSPC_Products.html?id=CR9RnD2g<BR>http://www.frozencpu.com/cat/l1/g60/LutroO_Products.html?id=CR9RnD2g<BR>http://www.frozencpu.com/cat/l1/g12/Accessories.html?id=CR9RnD2g<BR>http://www.frozencpu.com/cat/l1/g40/Air_Cooling.html?id=CR9RnD2g<BR>http://www.frozencpu.com/cat/l1/g53/Apparel.html?id=CR9RnD2g<BR>http://www.frozencpu.com/cat/l1/g34/Bay_Devices.html?id=CR9RnD2g<BR>http://www.frozencpu.com/cat/l1/g54/Cabinet_Cooling.html?id=CR9RnD2g<BR>http://www.frozencpu.com/cat/l1/g2/Cables.html?id=CR9RnD2g<BR>http://www.frozencpu.com/cat/l1/g32/Caffeine.html?id=CR9RnD2g<BR>http://www.frozencpu.com/cat/l1/g1/Cases.html?id=CR9RnD2g<BR>http://www.frozencpu.com/cat/l1/g58/CaseLabs_Cases.html?id=CR9RnD2g<BR>http://www.frozencpu.com/cat/l1/g45/Custom_Cases.html?id=CR9RnD2g<BR>http://www.frozencpu.com/cat/l1/g43/Case_Parts-OEM.html?id=CR9RnD2g<BR>http://www.frozencpu.com/cat/l1/g51/Connectors.html?id=CR9RnD2g<BR>http://www.frozencpu.com/cat/l1/g48/CPU_Heatsinks.html?id=CR9RnD2g<BR>http://www.frozencpu.com/cat/l1/g44/DIYMod_Parts.html?id=CR9RnD2g<BR>http://www.frozencpu.com/cat/l1/g4/Electronics.html?id=CR9RnD2g<BR>http://www.frozencpu.com/cat/l1/g36/Fans.html?id=CR9RnD2g<BR>http://www.frozencpu.com/cat/l1/g47/Fan_Accessories.html?id=CR9RnD2g<BR>http://www.frozencpu.com/cat/l1/g39/Gaming.html?id=CR9RnD2g<BR>http://www.frozencpu.com/cat/l1/g6/Lighting.html?id=CR9RnD2g<BR>http://www.frozencpu.com/cat/l1/g49/Phase_Change.html?id=CR9RnD2g<BR>http://www.frozencpu.com/cat/l1/g11/Power_Supplies.html?id=CR9RnD2g<BR>http://www.frozencpu.com/cat/l1/g55/Screws.html?id=CR9RnD2g<BR>http://www.frozencpu.com/cat/l1/g35/SleevingHeatshrink.html?id=CR9RnD2g<BR>http://www.frozencpu.com/cat/l1/g7/Sound_Dampening.html?id=CR9RnD2g<BR>http://www.frozencpu.com/cat/l1/g52/Switches.html?id=CR9RnD2g<BR>http://www.frozencpu.com/cat/l1/g8/Thermal_Interface.html?id=CR9RnD2g<BR>http://www.frozencpu.com/cat/l1/g31/Travel_Cases.html?id=CR9RnD2g<BR>http://www.frozencpu.com/cat/l1/g33/Ultra_Quiet.html?id=CR9RnD2g<BR>http://www.frozencpu.com/cat/l1/g42/Window_Kits.html?id=CR9RnD2g<BR>http://www.frozencpu.com/cat/l1/g50/Custom_Services.html?id=CR9RnD2g<BR>http://www.frozencpu.com/index.html?enable=1&id=CR9RnD2g<BR>http://www.frozencpu.com/products/2770/gc-01/Gift_Certificate.html?id=CR9RnD2g<BR>http://www.frozencpu.com/rebates.html?id=CR9RnD2g<BR>http://www.frozencpu.com/aboutus.html?id=CR9RnD2g<BR>http://www.frozencpu.com/resource.html?id=CR9RnD2g<BR>http://www.frozencpu.com/career.html?id=CR9RnD2g<BR>http://www.frozencpu.com/clearance/list/p1/Clearance-Page1.html?id=CR9RnD2g<BR>http://www.frozencpu.com/contactus.html?id=CR9RnD2g<BR>http://www.frozencpu.com/help_center.html?id=CR9RnD2g<BR>http://www.frozencpu.com/news.html?id=CR9RnD2g<BR>http://www.frozencpu.com/links.html?id=CR9RnD2g<BR>https://www.frozencpu.com/login.html?id=CR9RnD2g<BR>http://www.frozencpu.com/media.html?id=CR9RnD2g<BR>http://www.frozencpu.com/account.html?id=CR9RnD2g<BR>http://www.frozencpu.com/manage_carts.html?view_cart=Wish%2dList&wish_list=1&id=CR9RnD2g<BR>http://www.frozencpu.com/new_products.html?id=CR9RnD2g<BR>http://www.frozencpu.com/powder_coating.html?id=CR9RnD2g<BR>http://www.frozencpu.com/press.html?id=CR9RnD2g<BR>http://www.frozencpu.com/rebates.html?id=CR9RnD2g<BR>http://www.frozencpu.com/cart.html?id=CR9RnD2g<BR>http://www.frozencpu.com/sitemap.html?id=CR9RnD2g<BR>http://www.frozencpu.com/testimonials.html?id=CR9RnD2g<BR>http://www.frozencpu.com/tracking.html?id=CR9RnD2g<BR>http://www.frozencpu.com/stores.html?id=CR9RnD2g<BR>
<a href='http://www.facebook.com/pages/FrozenCPUcom/351841771499' target=<BR>
<a href='http://twitter.com/FrozenCPU' target=<BR>
<a href='/rss/frozencpu.rss' target=<BR>https://www.resellerratings.com
<BR>https://www.securitymetrics.com/sitecertsummary.adp?s=67%2e228%2e74%2e232&i=340380<BR>mailto:lori@frozencpu.com?subject=WESTERN%20UNION<BR>http://www.frozencpu.com/products/23382/ex-wat-303/XSPC_Raystorm_RX240_V3_Extreme_Universal_CPU_Water_Cooling_Kit_w_D5_Variant_Pump_Included_and_Free_Dead-Water.html?id=CR9RnD2g<BR>http://www.frozencpu.com/products/23382/ex-wat-303/XSPC_Raystorm_RX240_V3_Extreme_Universal_CPU_Water_Cooling_Kit_w_D5_Variant_Pump_Included_and_Free_Dead-Water.html?id=CR9RnD2g
The XSPC Raystorm RX240 V3 Universal CPU Water Cooling Kit comes complete with everything you will need to cool your CPU. This kit is designed to handle your CPU and can be expanded to handle more blocks as well.
The kit uses the newest XSPC CPU block, the Raystorm as the core cooling component. This block has a pure copper base and is a top o...
3 In Stock, Ships Today Till 6pm EST
9.99
<BR>http://www.frozencpu.com/products/17220/ex-wat-223/XSPC_Copper_Raystorm_AX240_Extreme_Intel_CPU_Water_Cooling_Kit_w_Twin_D5_w_Free_Dead-Water.html?id=CR9RnD2g<BR>http://www.frozencpu.com/products/17220/ex-wat-223/XSPC_Copper_Raystorm_AX240_Extreme_Intel_CPU_Water_Cooling_Kit_w_Twin_D5_w_Free_Dead-Water.html?id=CR9RnD2g
The RayStorm Copper Twin D5 AX240 kit is the most powerful 240 kit XSPC have ever made. It includes a special Copper edition of our RayStorm block, our fantastic new AX240 radiator and two D5 Vario pumps in series.
The RayStorm Copper has the same great performance as our award winning RayStorm block, but with an all metal design. The acetal top...
7 In Stock, Ships Today Till 6pm EST
9.99
<BR>http://www.frozencpu.com/products/22914/cas-495/PrimoChill_Hasher_-_Rugged_Crypto_Stackable_Mining_Rack_R-HRC.html?id=CR9RnD2g<BR>http://www.frozencpu.com/products/22914/cas-495/PrimoChill_Hasher_-_Rugged_Crypto_Stackable_Mining_Rack_R-HRC.html?id=CR9RnD2g
PrimoChill once again provides a good lookin, easy solution to the unimaginable. Introducing, one hell of a crypto rack, The Hasher!
Built out of rugged, 1in anodized extruded aluminum t-slot, the PrimoChill Hasher is tough but cool enough to keep out of the basement. It combines not only functionality but order to the chaos that other mining r...
5 In Stock, Ships Today Till 6pm EST
9.99
<BR>http://www.frozencpu.com/products/13815/ele-933/Add2PSU_Multiple_Power_Supply_Adapter.html?id=CR9RnD2g<BR>http://www.frozencpu.com/products/13815/ele-933/Add2PSU_Multiple_Power_Supply_Adapter.html?id=CR9RnD2g
Small, lightweight, and true Plug N Play, the Add2Psu adapter allows you to add more power to your computer. No cutting wires or soldering, no compromising the integrity or function of your PC.
Now there is a way to add more power to your PC. Finally a true plug and play way to manage additional power for those big video cards, bigger hard drive...
290 In Stock, Ships Today Till 6pm EST
.95
<BR>http://www.frozencpu.com/products/25635/ex-wat-335/Larkooler_SkyWater_330L_All-In-One_Liquid_Cooling_Kit_LCS0030.html?id=CR9RnD2g<BR>http://www.frozencpu.com/products/25635/ex-wat-335/Larkooler_SkyWater_330L_All-In-One_Liquid_Cooling_Kit_LCS0030.html?id=CR9RnD2g
The SkyWater 330L is a new liquld cooling system with a variable speed pump and Fans in desktop PC. The water cooling system is designed for the best thermal solution of CPU, the most important component of your PC. The SkyWater 330L provides a low noise at low speed fans , high performance at high speed fans and reliable liquid cooling system.
...
4 In Stock, Ships Today Till 6pm EST
9.99
<BR>http://www.frozencpu.com/products/26337/ex-blc-1942/Aquacomputer_Kryographics_GTX_980_Full_Coverage_Liquid_Cooling_Block_-_Copper_Acrylic_Glass_23614.html?id=CR9RnD2g<BR>http://www.frozencpu.com/products/26337/ex-blc-1942/Aquacomputer_Kryographics_GTX_980_Full_Coverage_Liquid_Cooling_Block_-_Copper_Acrylic_Glass_23614.html?id=CR9RnD2g
Combined GPU/RAM/VRM-cooler for graphics cards of the type nvidia GTX 980 with 4 GB RAM according to reference design.
This cooler combines the features of a graphics chip cooler and RAM-coolers in an elegant and very flat watercooler. Additionally the voltage regulators are also cooled effectively.
The kryographics for GTX 980 water block offe...
5 In Stock, Ships Today Till 6pm EST
9.99
<BR>http://www.frozencpu.com/products/19760/bus-348/Lamptron_CW611_36W_-_6_Channel_Aluminum_Liquid_Cooling_Controller_-_Black_CW611.html?id=CR9RnD2g<BR>http://www.frozencpu.com/products/19760/bus-348/Lamptron_CW611_36W_-_6_Channel_Aluminum_Liquid_Cooling_Controller_-_Black_CW611.html?id=CR9RnD2g
Introducing the Lamptron CW611 Water Cooling fan controller! The first in a series of advanced control 5.25″ bay devices that allow complete control over your entire PC cooling system. You can use this controller to be used with fans, liquid cooling pumps, as well as flow meters. The first in a new series of controllers this is sure to get ...
52 In Stock, Ships Today Till 6pm EST
.99
<BR>http://www.frozencpu.com/products/9350/fan-583/Noiseblocker_NB-BlackSilentFan_XM2_40mmx10mm_Ultra_Quiet_Fan_-_3800_RPM_-_14_dBA.html?id=CR9RnD2g<BR>http://www.frozencpu.com/products/9350/fan-583/Noiseblocker_NB-BlackSilentFan_XM2_40mmx10mm_Ultra_Quiet_Fan_-_3800_RPM_-_14_dBA.html?id=CR9RnD2g
The Noiseblocker NB-BlackSilentFan XM2 40mmx10mm Ultra Quiet Fan, manufactured by Noiseblocker, Germany's quietest fan manufacturer, the BlackSilentFan series features extraordinary life spans and near silent operation. Using the NB-Longlife advanced sleeve bearing and matched with the NB-EKA drive, the BlackSilentFan series runs more than double ...
20 In Stock, Ships Today Till 6pm EST
.95
<BR>http://www.frozencpu.com/products/25250/cst-1779/Phanteks_Enthoo_Luxe_Full_Tower_Chassis_w_Window_-_White_PH-ES614L_WT.html?id=CR9RnD2g<BR>http://www.frozencpu.com/products/25250/cst-1779/Phanteks_Enthoo_Luxe_Full_Tower_Chassis_w_Window_-_White_PH-ES614L_WT.html?id=CR9RnD2g
Staying true to the Phanteks’ Enthoo line, the Luxe features a sandblasted front and top panel. Ambient lighting run from top to front of the case on both sides. Even though smaller in size, the Enthoo Luxe boost many features from the award-winning Enthoo Primo. The Luxe comes pre-installed with a 200mm front fan and 2x PH-F140SP fans. Phanteks’ E...
In Stock, Ships Today Till 6pm EST
9.99
<BR>http://www.frozencpu.com/products/25721/ex-wat-337/MagiCool_DIY_Complete_Single_120mm_Liquid_Cooling_Kit_MC-G12V1.html?id=CR9RnD2g<BR>http://www.frozencpu.com/products/25721/ex-wat-337/MagiCool_DIY_Complete_Single_120mm_Liquid_Cooling_Kit_MC-G12V1.html?id=CR9RnD2g
The MagiCool DIY Complete Liquid Cooling Kit comes with everything you need to set your system up on liquid. The CPU block is compatible with all current sockets giving you flexibility for now and for future upgrades as well. The radiator is a slim profile variant allowing for maximum case compatibility.
Compression fittings are provided for dur...
5 In Stock, Ships Today Till 6pm EST
4.99
<BR>http://www.frozencpu.com/products/26065/ex-blc-1936/Alphacool_NexXxoS_GPX_Nvidia_Geforce_GTX_970_M03_Liquid_Cooling_Blockw_Backplate_11199.html?id=CR9RnD2g<BR>http://www.frozencpu.com/products/26065/ex-blc-1936/Alphacool_NexXxoS_GPX_Nvidia_Geforce_GTX_970_M03_Liquid_Cooling_Blockw_Backplate_11199.html?id=CR9RnD2g
With the new NexXxoS GPX coolers Alphacool is again a step ahead! Optimum performance and quality in a new cooling design for a great price!
A new sophisticated injection system means the GPU is actively cooled. All other chips are sufficiently cooled by the passive cooler which is also in contact with the watercooling block for extra efficiency...
3 In Stock, Ships Today Till 6pm EST
.99
<BR>http://www.frozencpu.com/products/14175/bus-285/Alphacool_Heatmaster_II_Liquid_Cooling_PCB_Control_Board_26153.html?id=CR9RnD2g<BR>http://www.frozencpu.com/products/14175/bus-285/Alphacool_Heatmaster_II_Liquid_Cooling_PCB_Control_Board_26153.html?id=CR9RnD2g
The new generation of cooling control from Alphacool: The Heatmaster II
The new Alphacool Heatmaster II was developed in Germany over multiple years, and has continuously been improved considering the experiences from the first version. Hence we are now, after a development and testing period of almost 3 years, able to present the best Heatmaste...
4 In Stock, Ships Today Till 6pm EST
.99
<BR>http://www.frozencpu.com/products/23748/ex-tub-3052/EK_ZMT_Tubing_-_38_ID_58OD_-_1_Foot_-_Black_EK-Tube_ZMT_Matte_Black_15995mm.html?id=CR9RnD2g<BR>http://www.frozencpu.com/products/23748/ex-tub-3052/EK_ZMT_Tubing_-_38_ID_58OD_-_1_Foot_-_Black_EK-Tube_ZMT_Matte_Black_15995mm.html?id=CR9RnD2g
EK ZMT (Zero Maintainance Tubing) is a high quality, zero maintainance industrial grade EPDM rubber tubing in stylish matte black.
This tubing is - just like Norprene - designed to withstand harsh conditions for a very long period of time, offering a truly exceptional lifespan even under UV, ozone and heat exposure for many years.
Unlike most...
62 In Stock, Ships Today Till 6pm EST
.50
<BR>http://www.frozencpu.com/products/25897/ex-wat-342/XSPC_Raystorm_EX360_Extreme_Universal_CPU_Water_Cooling_Kit_w_DDC_Photon_and_Free_Dead-Water.html?id=CR9RnD2g<BR>http://www.frozencpu.com/products/25897/ex-wat-342/XSPC_Raystorm_EX360_Extreme_Universal_CPU_Water_Cooling_Kit_w_DDC_Photon_and_Free_Dead-Water.html?id=CR9RnD2g
The XSPC Raystorm DDC Photon EX360 Universal CPU Water Cooling Kit comes complete with everything you will need to cool your CPU. This kit is designed to handle your CPU and can be expanded to handle more blocks as well.
The kit uses the newest XSPC CPU block, the Raystorm as the core cooling component. This block has a pure copper base and is...
5 In Stock, Ships Today Till 6pm EST
4.99
<BR>http://www.frozencpu.com/products/26379/fan-1397/Alphacool_Susurro_120mm_x_25mm_Fan_-_1700RPM_24684.html?id=CR9RnD2g<BR>http://www.frozencpu.com/products/26379/fan-1397/Alphacool_Susurro_120mm_x_25mm_Fan_-_1700RPM_24684.html?id=CR9RnD2g
A new generation of fans joins the Alphacool range. The Susurro, Spanish for Whisper.
A fundamental review of known fan designs was used to manufacture the Susurro. The perfect harmony between the AlphaCool blue and deep blacks make a great impression. The transparent black fan is optimized to cause virtually no noise.
But don’t be persuaded ...
2 In Stock, Ships Today Till 6pm EST
.99
<BR>http://www.frozencpu.com/products/18800/ex-res-486/Alphacool_Clip-On_Reservoir_Mount_2_Piece_Set_w_5mm_LED_Support_-_50mm.html?id=CR9RnD2g<BR>http://www.frozencpu.com/products/18800/ex-res-486/Alphacool_Clip-On_Reservoir_Mount_2_Piece_Set_w_5mm_LED_Support_-_50mm.html?id=CR9RnD2g
The best Alphacool reservoir mounts of all times!
Many reservoir mounts were designed for the original tube reservoirs from the beginning of the PC water cooling sector. During the last years though, the reservoirs became larger, sized for more capacity and metal was integrated for the end caps. This resulted in heavier reservoirs, making the co...
1 In Stock, Ships Today Till 6pm EST
.99
<BR>http://www.frozencpu.com/news.html?id=CR9RnD2g<BR>https://www.frozencpu.com/login.html?id=CR9RnD2g<BR>https://www.frozencpu.com/login.html?id=CR9RnD2g<BR>https://www.frozencpu.com/login.html?id=CR9RnD2g<BR>https://www.frozencpu.com/login.html?gu=1&id=CR9RnD2g<BR>http://www.frozencpu.com/help/h25/Ordering_with_a_PO.html?id=CR9RnD2g<BR>http://www.frozencpu.com/testimonials.html?id=CR9RnD2g<BR>http://www.frozencpu.com/index.html?id=CR9RnD2g<BR>http://www.frozencpu.com/sitemap.html?id=CR9RnD2g<BR>http://www.frozencpu.com/help_center.html?id=CR9RnD2g<BR>http://www.frozencpu.com/contactus.html?id=CR9RnD2g<BR>http://www.frozencpu.com/problem.html?id=CR9RnD2g<BR>http://www.frozencpu.com/help/h15/Legal.html?id=CR9RnD2g<BR>http://www.frozencpu.com/help/h13.html?id=CR9RnD2g<BR>http://www.getfirefox.com<BR>
为了匹配所有类型的 URL,以下代码可以帮助您:
<?php
$content = '<html>
<title>Random Website I am Crawling</title>
<body>
Click <a href="http://clicklink.com">here</a> for foobar
Another site is http://foobar.com
</body>
</html>';
$regex = "((https?|ftp)\:\/\/)?"; // SCHEME
$regex .= "([a-z0-9+!*(),;?&=$_.-]+(\:[a-z0-9+!*(),;?&=$_.-]+)?@)?"; // User and Pass
$regex .= "([a-z0-9-.]*)\.([a-z]{2,4})"; // Host or IP
$regex .= "(\:[0-9]{2,5})?"; // Port
$regex .= "(\/([a-z0-9+$_-]\.?)+)*\/?"; // Path
$regex .= "(\?[a-z+&$_.-][a-z0-9;:@&%=+\/$_.-]*)?"; // GET Query
$regex .= "(#[a-z_.-][a-z0-9+$_.-]*)?"; // Anchor
$matches = array(); //create array
$pattern = "/$regex/";
preg_match_all($pattern, $content, $matches);
print_r(array_values(array_unique($matches[0])));
echo "<br><br>";
echo implode("<br>", array_values(array_unique($matches[0])));
/*
* With your code
*/
$db = new PDO('mysql:host=localhost;dbname=crawler;charset=utf8', 'crawler', '***', array(PDO::ATTR_EMULATE_PREPARES => false,
PDO::ATTR_ERRMODE => PDO::ERRMODE_EXCEPTION));
$url="http://www.frozencpu.com/";
$data=file_get_contents($url);
$matches = array();
preg_match_all($pattern, $data, $matches);
$array = array_values(array_unique($matches[0]));
$count = count($array);
for($i = 0; $i < $count; $i++) {
$db->exec("INSERT INTO urls(url, crawled) VALUES('{$array[$i]}', '0')");
}
?>
这是更新代码,似乎有效,但速度极慢。
<?php
$db = new PDO('mysql:host=localhost;dbname=crawler;charset=utf8', 'crawler', '***', array(PDO::ATTR_EMULATE_PREPARES => false,
PDO::ATTR_ERRMODE => PDO::ERRMODE_EXCEPTION));
$url="http://proxylists.connectionincognito.com/";
$content=file_get_contents($url);
$regex = "((https?|ftp)\:\/\/)?"; // SCHEME
$regex .= "([a-z0-9+!*(),;?&=$_.-]+(\:[a-z0-9+!*(),;?&=$_.-]+)?@)?"; // User and Pass
$regex .= "([a-z0-9-.]*)\.([a-z]{2,4})"; // Host or IP
$regex .= "(\:[0-9]{2,5})?"; // Port
$regex .= "(\/([a-z0-9+$_-]\.?)+)*\/?"; // Path
$regex .= "(\?[a-z+&$_.-][a-z0-9;:@&%=+\/$_.-]*)?"; // GET Query
$regex .= "(#[a-z_.-][a-z0-9+$_.-]*)?"; // Anchor
$matches = array(); //create array
$pattern = "/$regex/";
preg_match_all($pattern, $content, $matches);
$unique = array_unique($matches[0]);
foreach ($unique as $url) {
//Insert if none exist
$stmt = $db->prepare("SELECT * FROM urls WHERE url='$url'");
$stmt->bindParam(1, $_GET['id'], PDO::PARAM_INT);
$stmt->execute();
$row = $stmt->fetch(PDO::FETCH_ASSOC);
if( ! $row)
{
$db->exec("INSERT INTO urls(url, crawled) VALUES('$url', '0')");
}
//Insert end code
}
?>
参考:
如果你想要所有 URLs 你不能只看里面 <a href=
,特别是考虑到 <a>
的 属性 href
赢了' 始终是标签内的第一件事。像 <a target=_blank href=http://google.com>
这样的标签将被忽略。
如果你想搜索所有 URLs 而不管上下文你可以简单地忽略标签并寻找 URL 一般模式,像这样:
$urls = preg_match_all('/[a-z]+:\/\/[a-zA-Z0-9?+.=%:\/]+/', $content, $matches);
这可能需要大量润色,但应该可以让事情开始。 但是,请注意,这只会匹配完整的 URLs。 到相关页面的链接,例如 <a href="index.html">
显然不会匹配。
自 Regular Expressions are not a recommended solution to parse HTML 以来,恐怕您将不得不寻求更合适的解决方案,例如 DOMDocument()
来优化页面并充分寻找 URL。
所以基本上我想从网页中提取所有 URL,即使它们不是可点击的链接。
例如页面源可能是:
<html>
<title>Random Website I am Crawling</title>
<body>
Click <a href="http://clicklink.com">here</a> for foobar
Another site is http://foobar.com
</body>
</html>
我希望显示两个网址,
http://clicklink.com and http://foobar.com
我也不希望包含。
我当前的脚本抓取了 url,但似乎也抓取了一堆其他垃圾,使链接可点击且无法存储在数据库中。
这是我当前的代码。
<?php
$db = new PDO('mysql:host=localhost;dbname=crawler;charset=utf8', 'crawler', '***', array(PDO::ATTR_EMULATE_PREPARES => false,
PDO::ATTR_ERRMODE => PDO::ERRMODE_EXCEPTION));
$url="http://www.frozencpu.com/";
$data=file_get_contents($url);
$data = strip_tags($data,"<a>");
$d = preg_split("/<\/a>/",$data);
foreach ( $d as $k=>$u ){
if( strpos($u, "<a href=") !== FALSE ){
//echo $u;
//echo "<BR>";
$u = preg_replace("/.*<a\s+href=\"/sm","",$u);
$u = preg_replace("/\".*/","",$u);
//echo $u;
//echo "<BR>";
$db->exec("INSERT INTO urls(url, crawled) VALUES('$u', '0')");
}
}
?>
这是一个示例输出
http://www.facebook.com/pages/FrozenCPUcom/351841771499<BR>http://twitter.com/FrozenCPU<BR>/rss/frozencpu.rss<BR>http://www.frozencpu.com/index.html?id=CR9RnD2g<BR>http://www.frozencpu.com/index.html?id=CR9RnD2g<BR>http://www.frozencpu.com/cart.html?id=CR9RnD2g<BR>http://www.frozencpu.com/account.html?id=CR9RnD2g<BR>http://www.frozencpu.com/tracking.html?id=CR9RnD2g<BR>http://www.frozencpu.com/help_center.html?id=CR9RnD2g<BR>http://www.frozencpu.com/manage_carts.html?id=CR9RnD2g<BR>
*到这里为止似乎还不错
Then it just junks up big time
<a href='http://www.frozencpu.com/advanced_search.html?id=CR9RnD2g' class=small>Advanced Search<BR>http://www.frozencpu.com/brands/shop_by_brand.html?id=CR9RnD2g<BR>http://www.frozencpu.com/shop_category.html?id=CR9RnD2g<BR>http://www.frozencpu.com/cat/l1/g30/Liquid_Cooling.html?id=CR9RnD2g<BR>http://www.frozencpu.com/cat/l1/g57/EK_Products.html?id=CR9RnD2g<BR>http://www.frozencpu.com/cat/l1/g59/XSPC_Products.html?id=CR9RnD2g<BR>http://www.frozencpu.com/cat/l1/g60/LutroO_Products.html?id=CR9RnD2g<BR>http://www.frozencpu.com/cat/l1/g12/Accessories.html?id=CR9RnD2g<BR>http://www.frozencpu.com/cat/l1/g40/Air_Cooling.html?id=CR9RnD2g<BR>http://www.frozencpu.com/cat/l1/g53/Apparel.html?id=CR9RnD2g<BR>http://www.frozencpu.com/cat/l1/g34/Bay_Devices.html?id=CR9RnD2g<BR>http://www.frozencpu.com/cat/l1/g54/Cabinet_Cooling.html?id=CR9RnD2g<BR>http://www.frozencpu.com/cat/l1/g2/Cables.html?id=CR9RnD2g<BR>http://www.frozencpu.com/cat/l1/g32/Caffeine.html?id=CR9RnD2g<BR>http://www.frozencpu.com/cat/l1/g1/Cases.html?id=CR9RnD2g<BR>http://www.frozencpu.com/cat/l1/g58/CaseLabs_Cases.html?id=CR9RnD2g<BR>http://www.frozencpu.com/cat/l1/g45/Custom_Cases.html?id=CR9RnD2g<BR>http://www.frozencpu.com/cat/l1/g43/Case_Parts-OEM.html?id=CR9RnD2g<BR>http://www.frozencpu.com/cat/l1/g51/Connectors.html?id=CR9RnD2g<BR>http://www.frozencpu.com/cat/l1/g48/CPU_Heatsinks.html?id=CR9RnD2g<BR>http://www.frozencpu.com/cat/l1/g44/DIYMod_Parts.html?id=CR9RnD2g<BR>http://www.frozencpu.com/cat/l1/g4/Electronics.html?id=CR9RnD2g<BR>http://www.frozencpu.com/cat/l1/g36/Fans.html?id=CR9RnD2g<BR>http://www.frozencpu.com/cat/l1/g47/Fan_Accessories.html?id=CR9RnD2g<BR>http://www.frozencpu.com/cat/l1/g39/Gaming.html?id=CR9RnD2g<BR>http://www.frozencpu.com/cat/l1/g6/Lighting.html?id=CR9RnD2g<BR>http://www.frozencpu.com/cat/l1/g49/Phase_Change.html?id=CR9RnD2g<BR>http://www.frozencpu.com/cat/l1/g11/Power_Supplies.html?id=CR9RnD2g<BR>http://www.frozencpu.com/cat/l1/g55/Screws.html?id=CR9RnD2g<BR>http://www.frozencpu.com/cat/l1/g35/SleevingHeatshrink.html?id=CR9RnD2g<BR>http://www.frozencpu.com/cat/l1/g7/Sound_Dampening.html?id=CR9RnD2g<BR>http://www.frozencpu.com/cat/l1/g52/Switches.html?id=CR9RnD2g<BR>http://www.frozencpu.com/cat/l1/g8/Thermal_Interface.html?id=CR9RnD2g<BR>http://www.frozencpu.com/cat/l1/g31/Travel_Cases.html?id=CR9RnD2g<BR>http://www.frozencpu.com/cat/l1/g33/Ultra_Quiet.html?id=CR9RnD2g<BR>http://www.frozencpu.com/cat/l1/g42/Window_Kits.html?id=CR9RnD2g<BR>http://www.frozencpu.com/cat/l1/g50/Custom_Services.html?id=CR9RnD2g<BR>http://www.frozencpu.com/index.html?enable=1&id=CR9RnD2g<BR>http://www.frozencpu.com/products/2770/gc-01/Gift_Certificate.html?id=CR9RnD2g<BR>http://www.frozencpu.com/rebates.html?id=CR9RnD2g<BR>http://www.frozencpu.com/aboutus.html?id=CR9RnD2g<BR>http://www.frozencpu.com/resource.html?id=CR9RnD2g<BR>http://www.frozencpu.com/career.html?id=CR9RnD2g<BR>http://www.frozencpu.com/clearance/list/p1/Clearance-Page1.html?id=CR9RnD2g<BR>http://www.frozencpu.com/contactus.html?id=CR9RnD2g<BR>http://www.frozencpu.com/help_center.html?id=CR9RnD2g<BR>http://www.frozencpu.com/news.html?id=CR9RnD2g<BR>http://www.frozencpu.com/links.html?id=CR9RnD2g<BR>https://www.frozencpu.com/login.html?id=CR9RnD2g<BR>http://www.frozencpu.com/media.html?id=CR9RnD2g<BR>http://www.frozencpu.com/account.html?id=CR9RnD2g<BR>http://www.frozencpu.com/manage_carts.html?view_cart=Wish%2dList&wish_list=1&id=CR9RnD2g<BR>http://www.frozencpu.com/new_products.html?id=CR9RnD2g<BR>http://www.frozencpu.com/powder_coating.html?id=CR9RnD2g<BR>http://www.frozencpu.com/press.html?id=CR9RnD2g<BR>http://www.frozencpu.com/rebates.html?id=CR9RnD2g<BR>http://www.frozencpu.com/cart.html?id=CR9RnD2g<BR>http://www.frozencpu.com/sitemap.html?id=CR9RnD2g<BR>http://www.frozencpu.com/testimonials.html?id=CR9RnD2g<BR>http://www.frozencpu.com/tracking.html?id=CR9RnD2g<BR>http://www.frozencpu.com/stores.html?id=CR9RnD2g<BR>
<a href='http://www.facebook.com/pages/FrozenCPUcom/351841771499' target=<BR>
<a href='http://twitter.com/FrozenCPU' target=<BR>
<a href='/rss/frozencpu.rss' target=<BR>https://www.resellerratings.com
<BR>https://www.securitymetrics.com/sitecertsummary.adp?s=67%2e228%2e74%2e232&i=340380<BR>mailto:lori@frozencpu.com?subject=WESTERN%20UNION<BR>http://www.frozencpu.com/products/23382/ex-wat-303/XSPC_Raystorm_RX240_V3_Extreme_Universal_CPU_Water_Cooling_Kit_w_D5_Variant_Pump_Included_and_Free_Dead-Water.html?id=CR9RnD2g<BR>http://www.frozencpu.com/products/23382/ex-wat-303/XSPC_Raystorm_RX240_V3_Extreme_Universal_CPU_Water_Cooling_Kit_w_D5_Variant_Pump_Included_and_Free_Dead-Water.html?id=CR9RnD2g
The XSPC Raystorm RX240 V3 Universal CPU Water Cooling Kit comes complete with everything you will need to cool your CPU. This kit is designed to handle your CPU and can be expanded to handle more blocks as well.
The kit uses the newest XSPC CPU block, the Raystorm as the core cooling component. This block has a pure copper base and is a top o...
3 In Stock, Ships Today Till 6pm EST
9.99
<BR>http://www.frozencpu.com/products/17220/ex-wat-223/XSPC_Copper_Raystorm_AX240_Extreme_Intel_CPU_Water_Cooling_Kit_w_Twin_D5_w_Free_Dead-Water.html?id=CR9RnD2g<BR>http://www.frozencpu.com/products/17220/ex-wat-223/XSPC_Copper_Raystorm_AX240_Extreme_Intel_CPU_Water_Cooling_Kit_w_Twin_D5_w_Free_Dead-Water.html?id=CR9RnD2g
The RayStorm Copper Twin D5 AX240 kit is the most powerful 240 kit XSPC have ever made. It includes a special Copper edition of our RayStorm block, our fantastic new AX240 radiator and two D5 Vario pumps in series.
The RayStorm Copper has the same great performance as our award winning RayStorm block, but with an all metal design. The acetal top...
7 In Stock, Ships Today Till 6pm EST
9.99
<BR>http://www.frozencpu.com/products/22914/cas-495/PrimoChill_Hasher_-_Rugged_Crypto_Stackable_Mining_Rack_R-HRC.html?id=CR9RnD2g<BR>http://www.frozencpu.com/products/22914/cas-495/PrimoChill_Hasher_-_Rugged_Crypto_Stackable_Mining_Rack_R-HRC.html?id=CR9RnD2g
PrimoChill once again provides a good lookin, easy solution to the unimaginable. Introducing, one hell of a crypto rack, The Hasher!
Built out of rugged, 1in anodized extruded aluminum t-slot, the PrimoChill Hasher is tough but cool enough to keep out of the basement. It combines not only functionality but order to the chaos that other mining r...
5 In Stock, Ships Today Till 6pm EST
9.99
<BR>http://www.frozencpu.com/products/13815/ele-933/Add2PSU_Multiple_Power_Supply_Adapter.html?id=CR9RnD2g<BR>http://www.frozencpu.com/products/13815/ele-933/Add2PSU_Multiple_Power_Supply_Adapter.html?id=CR9RnD2g
Small, lightweight, and true Plug N Play, the Add2Psu adapter allows you to add more power to your computer. No cutting wires or soldering, no compromising the integrity or function of your PC.
Now there is a way to add more power to your PC. Finally a true plug and play way to manage additional power for those big video cards, bigger hard drive...
290 In Stock, Ships Today Till 6pm EST
.95
<BR>http://www.frozencpu.com/products/25635/ex-wat-335/Larkooler_SkyWater_330L_All-In-One_Liquid_Cooling_Kit_LCS0030.html?id=CR9RnD2g<BR>http://www.frozencpu.com/products/25635/ex-wat-335/Larkooler_SkyWater_330L_All-In-One_Liquid_Cooling_Kit_LCS0030.html?id=CR9RnD2g
The SkyWater 330L is a new liquld cooling system with a variable speed pump and Fans in desktop PC. The water cooling system is designed for the best thermal solution of CPU, the most important component of your PC. The SkyWater 330L provides a low noise at low speed fans , high performance at high speed fans and reliable liquid cooling system.
...
4 In Stock, Ships Today Till 6pm EST
9.99
<BR>http://www.frozencpu.com/products/26337/ex-blc-1942/Aquacomputer_Kryographics_GTX_980_Full_Coverage_Liquid_Cooling_Block_-_Copper_Acrylic_Glass_23614.html?id=CR9RnD2g<BR>http://www.frozencpu.com/products/26337/ex-blc-1942/Aquacomputer_Kryographics_GTX_980_Full_Coverage_Liquid_Cooling_Block_-_Copper_Acrylic_Glass_23614.html?id=CR9RnD2g
Combined GPU/RAM/VRM-cooler for graphics cards of the type nvidia GTX 980 with 4 GB RAM according to reference design.
This cooler combines the features of a graphics chip cooler and RAM-coolers in an elegant and very flat watercooler. Additionally the voltage regulators are also cooled effectively.
The kryographics for GTX 980 water block offe...
5 In Stock, Ships Today Till 6pm EST
9.99
<BR>http://www.frozencpu.com/products/19760/bus-348/Lamptron_CW611_36W_-_6_Channel_Aluminum_Liquid_Cooling_Controller_-_Black_CW611.html?id=CR9RnD2g<BR>http://www.frozencpu.com/products/19760/bus-348/Lamptron_CW611_36W_-_6_Channel_Aluminum_Liquid_Cooling_Controller_-_Black_CW611.html?id=CR9RnD2g
Introducing the Lamptron CW611 Water Cooling fan controller! The first in a series of advanced control 5.25″ bay devices that allow complete control over your entire PC cooling system. You can use this controller to be used with fans, liquid cooling pumps, as well as flow meters. The first in a new series of controllers this is sure to get ...
52 In Stock, Ships Today Till 6pm EST
.99
<BR>http://www.frozencpu.com/products/9350/fan-583/Noiseblocker_NB-BlackSilentFan_XM2_40mmx10mm_Ultra_Quiet_Fan_-_3800_RPM_-_14_dBA.html?id=CR9RnD2g<BR>http://www.frozencpu.com/products/9350/fan-583/Noiseblocker_NB-BlackSilentFan_XM2_40mmx10mm_Ultra_Quiet_Fan_-_3800_RPM_-_14_dBA.html?id=CR9RnD2g
The Noiseblocker NB-BlackSilentFan XM2 40mmx10mm Ultra Quiet Fan, manufactured by Noiseblocker, Germany's quietest fan manufacturer, the BlackSilentFan series features extraordinary life spans and near silent operation. Using the NB-Longlife advanced sleeve bearing and matched with the NB-EKA drive, the BlackSilentFan series runs more than double ...
20 In Stock, Ships Today Till 6pm EST
.95
<BR>http://www.frozencpu.com/products/25250/cst-1779/Phanteks_Enthoo_Luxe_Full_Tower_Chassis_w_Window_-_White_PH-ES614L_WT.html?id=CR9RnD2g<BR>http://www.frozencpu.com/products/25250/cst-1779/Phanteks_Enthoo_Luxe_Full_Tower_Chassis_w_Window_-_White_PH-ES614L_WT.html?id=CR9RnD2g
Staying true to the Phanteks’ Enthoo line, the Luxe features a sandblasted front and top panel. Ambient lighting run from top to front of the case on both sides. Even though smaller in size, the Enthoo Luxe boost many features from the award-winning Enthoo Primo. The Luxe comes pre-installed with a 200mm front fan and 2x PH-F140SP fans. Phanteks’ E...
In Stock, Ships Today Till 6pm EST
9.99
<BR>http://www.frozencpu.com/products/25721/ex-wat-337/MagiCool_DIY_Complete_Single_120mm_Liquid_Cooling_Kit_MC-G12V1.html?id=CR9RnD2g<BR>http://www.frozencpu.com/products/25721/ex-wat-337/MagiCool_DIY_Complete_Single_120mm_Liquid_Cooling_Kit_MC-G12V1.html?id=CR9RnD2g
The MagiCool DIY Complete Liquid Cooling Kit comes with everything you need to set your system up on liquid. The CPU block is compatible with all current sockets giving you flexibility for now and for future upgrades as well. The radiator is a slim profile variant allowing for maximum case compatibility.
Compression fittings are provided for dur...
5 In Stock, Ships Today Till 6pm EST
4.99
<BR>http://www.frozencpu.com/products/26065/ex-blc-1936/Alphacool_NexXxoS_GPX_Nvidia_Geforce_GTX_970_M03_Liquid_Cooling_Blockw_Backplate_11199.html?id=CR9RnD2g<BR>http://www.frozencpu.com/products/26065/ex-blc-1936/Alphacool_NexXxoS_GPX_Nvidia_Geforce_GTX_970_M03_Liquid_Cooling_Blockw_Backplate_11199.html?id=CR9RnD2g
With the new NexXxoS GPX coolers Alphacool is again a step ahead! Optimum performance and quality in a new cooling design for a great price!
A new sophisticated injection system means the GPU is actively cooled. All other chips are sufficiently cooled by the passive cooler which is also in contact with the watercooling block for extra efficiency...
3 In Stock, Ships Today Till 6pm EST
.99
<BR>http://www.frozencpu.com/products/14175/bus-285/Alphacool_Heatmaster_II_Liquid_Cooling_PCB_Control_Board_26153.html?id=CR9RnD2g<BR>http://www.frozencpu.com/products/14175/bus-285/Alphacool_Heatmaster_II_Liquid_Cooling_PCB_Control_Board_26153.html?id=CR9RnD2g
The new generation of cooling control from Alphacool: The Heatmaster II
The new Alphacool Heatmaster II was developed in Germany over multiple years, and has continuously been improved considering the experiences from the first version. Hence we are now, after a development and testing period of almost 3 years, able to present the best Heatmaste...
4 In Stock, Ships Today Till 6pm EST
.99
<BR>http://www.frozencpu.com/products/23748/ex-tub-3052/EK_ZMT_Tubing_-_38_ID_58OD_-_1_Foot_-_Black_EK-Tube_ZMT_Matte_Black_15995mm.html?id=CR9RnD2g<BR>http://www.frozencpu.com/products/23748/ex-tub-3052/EK_ZMT_Tubing_-_38_ID_58OD_-_1_Foot_-_Black_EK-Tube_ZMT_Matte_Black_15995mm.html?id=CR9RnD2g
EK ZMT (Zero Maintainance Tubing) is a high quality, zero maintainance industrial grade EPDM rubber tubing in stylish matte black.
This tubing is - just like Norprene - designed to withstand harsh conditions for a very long period of time, offering a truly exceptional lifespan even under UV, ozone and heat exposure for many years.
Unlike most...
62 In Stock, Ships Today Till 6pm EST
.50
<BR>http://www.frozencpu.com/products/25897/ex-wat-342/XSPC_Raystorm_EX360_Extreme_Universal_CPU_Water_Cooling_Kit_w_DDC_Photon_and_Free_Dead-Water.html?id=CR9RnD2g<BR>http://www.frozencpu.com/products/25897/ex-wat-342/XSPC_Raystorm_EX360_Extreme_Universal_CPU_Water_Cooling_Kit_w_DDC_Photon_and_Free_Dead-Water.html?id=CR9RnD2g
The XSPC Raystorm DDC Photon EX360 Universal CPU Water Cooling Kit comes complete with everything you will need to cool your CPU. This kit is designed to handle your CPU and can be expanded to handle more blocks as well.
The kit uses the newest XSPC CPU block, the Raystorm as the core cooling component. This block has a pure copper base and is...
5 In Stock, Ships Today Till 6pm EST
4.99
<BR>http://www.frozencpu.com/products/26379/fan-1397/Alphacool_Susurro_120mm_x_25mm_Fan_-_1700RPM_24684.html?id=CR9RnD2g<BR>http://www.frozencpu.com/products/26379/fan-1397/Alphacool_Susurro_120mm_x_25mm_Fan_-_1700RPM_24684.html?id=CR9RnD2g
A new generation of fans joins the Alphacool range. The Susurro, Spanish for Whisper.
A fundamental review of known fan designs was used to manufacture the Susurro. The perfect harmony between the AlphaCool blue and deep blacks make a great impression. The transparent black fan is optimized to cause virtually no noise.
But don’t be persuaded ...
2 In Stock, Ships Today Till 6pm EST
.99
<BR>http://www.frozencpu.com/products/18800/ex-res-486/Alphacool_Clip-On_Reservoir_Mount_2_Piece_Set_w_5mm_LED_Support_-_50mm.html?id=CR9RnD2g<BR>http://www.frozencpu.com/products/18800/ex-res-486/Alphacool_Clip-On_Reservoir_Mount_2_Piece_Set_w_5mm_LED_Support_-_50mm.html?id=CR9RnD2g
The best Alphacool reservoir mounts of all times!
Many reservoir mounts were designed for the original tube reservoirs from the beginning of the PC water cooling sector. During the last years though, the reservoirs became larger, sized for more capacity and metal was integrated for the end caps. This resulted in heavier reservoirs, making the co...
1 In Stock, Ships Today Till 6pm EST
.99
<BR>http://www.frozencpu.com/news.html?id=CR9RnD2g<BR>https://www.frozencpu.com/login.html?id=CR9RnD2g<BR>https://www.frozencpu.com/login.html?id=CR9RnD2g<BR>https://www.frozencpu.com/login.html?id=CR9RnD2g<BR>https://www.frozencpu.com/login.html?gu=1&id=CR9RnD2g<BR>http://www.frozencpu.com/help/h25/Ordering_with_a_PO.html?id=CR9RnD2g<BR>http://www.frozencpu.com/testimonials.html?id=CR9RnD2g<BR>http://www.frozencpu.com/index.html?id=CR9RnD2g<BR>http://www.frozencpu.com/sitemap.html?id=CR9RnD2g<BR>http://www.frozencpu.com/help_center.html?id=CR9RnD2g<BR>http://www.frozencpu.com/contactus.html?id=CR9RnD2g<BR>http://www.frozencpu.com/problem.html?id=CR9RnD2g<BR>http://www.frozencpu.com/help/h15/Legal.html?id=CR9RnD2g<BR>http://www.frozencpu.com/help/h13.html?id=CR9RnD2g<BR>http://www.getfirefox.com<BR>
为了匹配所有类型的 URL,以下代码可以帮助您:
<?php
$content = '<html>
<title>Random Website I am Crawling</title>
<body>
Click <a href="http://clicklink.com">here</a> for foobar
Another site is http://foobar.com
</body>
</html>';
$regex = "((https?|ftp)\:\/\/)?"; // SCHEME
$regex .= "([a-z0-9+!*(),;?&=$_.-]+(\:[a-z0-9+!*(),;?&=$_.-]+)?@)?"; // User and Pass
$regex .= "([a-z0-9-.]*)\.([a-z]{2,4})"; // Host or IP
$regex .= "(\:[0-9]{2,5})?"; // Port
$regex .= "(\/([a-z0-9+$_-]\.?)+)*\/?"; // Path
$regex .= "(\?[a-z+&$_.-][a-z0-9;:@&%=+\/$_.-]*)?"; // GET Query
$regex .= "(#[a-z_.-][a-z0-9+$_.-]*)?"; // Anchor
$matches = array(); //create array
$pattern = "/$regex/";
preg_match_all($pattern, $content, $matches);
print_r(array_values(array_unique($matches[0])));
echo "<br><br>";
echo implode("<br>", array_values(array_unique($matches[0])));
/*
* With your code
*/
$db = new PDO('mysql:host=localhost;dbname=crawler;charset=utf8', 'crawler', '***', array(PDO::ATTR_EMULATE_PREPARES => false,
PDO::ATTR_ERRMODE => PDO::ERRMODE_EXCEPTION));
$url="http://www.frozencpu.com/";
$data=file_get_contents($url);
$matches = array();
preg_match_all($pattern, $data, $matches);
$array = array_values(array_unique($matches[0]));
$count = count($array);
for($i = 0; $i < $count; $i++) {
$db->exec("INSERT INTO urls(url, crawled) VALUES('{$array[$i]}', '0')");
}
?>
这是更新代码,似乎有效,但速度极慢。
<?php
$db = new PDO('mysql:host=localhost;dbname=crawler;charset=utf8', 'crawler', '***', array(PDO::ATTR_EMULATE_PREPARES => false,
PDO::ATTR_ERRMODE => PDO::ERRMODE_EXCEPTION));
$url="http://proxylists.connectionincognito.com/";
$content=file_get_contents($url);
$regex = "((https?|ftp)\:\/\/)?"; // SCHEME
$regex .= "([a-z0-9+!*(),;?&=$_.-]+(\:[a-z0-9+!*(),;?&=$_.-]+)?@)?"; // User and Pass
$regex .= "([a-z0-9-.]*)\.([a-z]{2,4})"; // Host or IP
$regex .= "(\:[0-9]{2,5})?"; // Port
$regex .= "(\/([a-z0-9+$_-]\.?)+)*\/?"; // Path
$regex .= "(\?[a-z+&$_.-][a-z0-9;:@&%=+\/$_.-]*)?"; // GET Query
$regex .= "(#[a-z_.-][a-z0-9+$_.-]*)?"; // Anchor
$matches = array(); //create array
$pattern = "/$regex/";
preg_match_all($pattern, $content, $matches);
$unique = array_unique($matches[0]);
foreach ($unique as $url) {
//Insert if none exist
$stmt = $db->prepare("SELECT * FROM urls WHERE url='$url'");
$stmt->bindParam(1, $_GET['id'], PDO::PARAM_INT);
$stmt->execute();
$row = $stmt->fetch(PDO::FETCH_ASSOC);
if( ! $row)
{
$db->exec("INSERT INTO urls(url, crawled) VALUES('$url', '0')");
}
//Insert end code
}
?>
参考:
如果你想要所有 URLs 你不能只看里面 <a href=
,特别是考虑到 <a>
的 属性 href
赢了' 始终是标签内的第一件事。像 <a target=_blank href=http://google.com>
这样的标签将被忽略。
如果你想搜索所有 URLs 而不管上下文你可以简单地忽略标签并寻找 URL 一般模式,像这样:
$urls = preg_match_all('/[a-z]+:\/\/[a-zA-Z0-9?+.=%:\/]+/', $content, $matches);
这可能需要大量润色,但应该可以让事情开始。 但是,请注意,这只会匹配完整的 URLs。 到相关页面的链接,例如 <a href="index.html">
显然不会匹配。
自 Regular Expressions are not a recommended solution to parse HTML 以来,恐怕您将不得不寻求更合适的解决方案,例如 DOMDocument()
来优化页面并充分寻找 URL。