无法获取特定 url 的元标记
can't able to fetch meta tag for particular url
我正在使用 php 脚本从特定网站的元标记中获取关键字。但是对于某些 URL 它不起作用,当我手动检查那个 URL 的关键字时,我发现网页中存在关键字。
$url = "https://www.washingtonpost.com/news/education/wp/2018/02/14/school-shooting-reported-at-florida-high-school/?tid=pm_pop";
get_meta_tags($url);
它总是给我警告:-
警告:get_meta_tags(https://www.washingtonpost.com/politics/stormy-danielss-tale-gains-renewed-momentum-with-trump-lawyers-claim-which-raises-new-questions/2018/02/14/e7ce4a16-119d-11e8-9065-e55346f6de81_story.html?tid=pm_pop):无法打开流:达到重定向限制
有什么想法吗?
走吧:
first : 有一个 infinty 重定向循环,so仅当您启用 cookies
时,服务器才会为您提供该页面。
所以我们将使用 curl
函数 获取 html 页面,分两步:
- 获取饼干
- 重新发送 cookie 并获取页面
second :解析 html 以使用 preg_match
获取元标记:
最后代码将变成:
https://www.washingtonpost.com/news/education/wp/2018/02/14/school-shooting-reported-at-florida-high-school/?tid=pm_pop');
//parsing begins here:
preg_match_all('/<[\s]meta[\s](name|property)="?' . '([^>"])"?[\s]' . 'content="?([^>"])"?[\s][/]?[\s]*>/si', $html, $match);
$count = count($match[2]);
for($i = 0; $i < $count; $i++){
echo($match[2][$i]." : ".$match[3][$i]."
");
}
function get_contents($link) {
$result ="";
try{
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $link);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, 0);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, 0);
curl_setopt ($ch, CURLOPT_COOKIEJAR, "-"); // <-- see here
$result = curl_exec($ch);
// remember i didn't close the curl yet!
// Now make another curl request with the same handle:
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
$result = curl_exec($ch);
// if you are done, you can close it.
$result = curl_exec($ch);
$httpcode = curl_getinfo($ch, CURLINFO_HTTP_CODE);
$curlerr = curl_error($ch);
curl_close($ch);
} catch (Exception $e) {
$result = "Error1 :". $result."||".$e;
}
if(strlen($result) < 5){$result = $result."Error :".$httpcode.$curlerr;}
return $result;
}
?>
注意:html不能被domdocument
解析
输出:
object-hash : 1518960831
referrer : unsafe-url
keywords : Florida school shooting, Marjory Stoneman Douglas High School, Parkland school shooting, Florida shooting, Broward County
news_keywords : Florida school shooting, Marjory Stoneman Douglas High School, Parkland school shooting, Florida shooting, Broward County
twitter:card : summary_large_image
og:type : article
og:site_name : Washington Post
magnet : floridashooting
article:publisher : https://www.facebook.com/washingtonpost
fb:app_id : 41245586762
fb:admins : 4403963
fb:admins : 500835072
article:content_tier : metered
og:url : https://www.washingtonpost.com/news/education/wp/2018/02/14/school-shooting-reported-at-florida-high-school/
og:title : ‘A horrific, horrific day’: At least 17 killed in Florida school shooting
og:description : The suspect, a student who had been expelled, was armed with an AR-15, authorities said.
robots : index,follow
theme : normal
audio_url :
twitter:creator : @lori_rozsa
article:author : https://www.facebook.com/moriah.balingit
author : https://www.facebook.com/moriah.balingit
twitter:creator : @ByMoriah
twitter:creator : @thewanreport
article:author : https://www.facebook.com/markberman
author : https://www.facebook.com/markberman
twitter:creator : @markberman
我正在使用 php 脚本从特定网站的元标记中获取关键字。但是对于某些 URL 它不起作用,当我手动检查那个 URL 的关键字时,我发现网页中存在关键字。
$url = "https://www.washingtonpost.com/news/education/wp/2018/02/14/school-shooting-reported-at-florida-high-school/?tid=pm_pop";
get_meta_tags($url);
它总是给我警告:-
警告:get_meta_tags(https://www.washingtonpost.com/politics/stormy-danielss-tale-gains-renewed-momentum-with-trump-lawyers-claim-which-raises-new-questions/2018/02/14/e7ce4a16-119d-11e8-9065-e55346f6de81_story.html?tid=pm_pop):无法打开流:达到重定向限制
有什么想法吗?
走吧:
first : 有一个 infinty 重定向循环,so仅当您启用
cookies
时,服务器才会为您提供该页面。 所以我们将使用curl
函数 获取 html 页面,分两步:- 获取饼干
- 重新发送 cookie 并获取页面
second :解析 html 以使用
preg_match
获取元标记:最后代码将变成:
https://www.washingtonpost.com/news/education/wp/2018/02/14/school-shooting-reported-at-florida-high-school/?tid=pm_pop'); //parsing begins here: preg_match_all('/<[\s]meta[\s](name|property)="?' . '([^>"])"?[\s]' . 'content="?([^>"])"?[\s][/]?[\s]*>/si', $html, $match); $count = count($match[2]); for($i = 0; $i < $count; $i++){ echo($match[2][$i]." : ".$match[3][$i]."
"); }function get_contents($link) { $result =""; try{ $ch = curl_init(); curl_setopt($ch, CURLOPT_URL, $link); curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 0); curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); curl_setopt($ch, CURLOPT_HEADER, 0); curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, 0); curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, 0); curl_setopt ($ch, CURLOPT_COOKIEJAR, "-"); // <-- see here $result = curl_exec($ch); // remember i didn't close the curl yet!
// Now make another curl request with the same handle: curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1); $result = curl_exec($ch); // if you are done, you can close it. $result = curl_exec($ch); $httpcode = curl_getinfo($ch, CURLINFO_HTTP_CODE); $curlerr = curl_error($ch); curl_close($ch); } catch (Exception $e) { $result = "Error1 :". $result."||".$e; } if(strlen($result) < 5){$result = $result."Error :".$httpcode.$curlerr;}
return $result; } ?>
注意:html不能被domdocument
输出:
object-hash : 1518960831
referrer : unsafe-url
keywords : Florida school shooting, Marjory Stoneman Douglas High School, Parkland school shooting, Florida shooting, Broward County
news_keywords : Florida school shooting, Marjory Stoneman Douglas High School, Parkland school shooting, Florida shooting, Broward County
twitter:card : summary_large_image
og:type : article
og:site_name : Washington Post
magnet : floridashooting
article:publisher : https://www.facebook.com/washingtonpost
fb:app_id : 41245586762
fb:admins : 4403963
fb:admins : 500835072
article:content_tier : metered
og:url : https://www.washingtonpost.com/news/education/wp/2018/02/14/school-shooting-reported-at-florida-high-school/
og:title : ‘A horrific, horrific day’: At least 17 killed in Florida school shooting
og:description : The suspect, a student who had been expelled, was armed with an AR-15, authorities said.
robots : index,follow
theme : normal
audio_url :
twitter:creator : @lori_rozsa
article:author : https://www.facebook.com/moriah.balingit
author : https://www.facebook.com/moriah.balingit
twitter:creator : @ByMoriah
twitter:creator : @thewanreport
article:author : https://www.facebook.com/markberman
author : https://www.facebook.com/markberman
twitter:creator : @markberman