从网页中提取标题和摘要
Extracting the title and abstract from a webpage
我正在尝试从 arXiv 页面中提取标题和摘要,例如 http://arxiv.org/abs/1207.0102,我的代码目前看起来像
function get_title($url){
$str = file_get_contents($url);
if(strlen($str)>0){
$str = trim(preg_replace('/\s+/', ' ', $str)); // supports line breaks inside <title>
preg_match("/\<title\>(.*)\<\/title\>/i",$str,$title); // ignore case
return $title[1];
}
}
echo get_title("http://arxiv.org/abs/1207.0102");
当我运行这段代码时,出现这个错误
Warning: file_get_contents(http://arxiv.org/abs/1207.0102): failed to
open stream: HTTP request failed! HTTP/1.1 403 Forbidden in
C:\wamp\www\mysite\Index.php
当我尝试不同的 url 时,这个问题不会发生,例如 http://www.washingtontimes.com/。
有人知道为什么会这样吗?
另外,是否可以从该网页中提取摘要?
不允许空用户代理的网站响应:
HTTP/1.1 403 Forbidden
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html>
<head><title>403 Forbidden</title></head>
<body>
<h1>Access Denied</h1>
<p>Sadly, your client does not supply a proper User-Agent,
and is consequently excluded.</p>
<p>We have an inordinate number of problems with automated scripts
which do not supply a User-Agent, and violate the automated access
guidelines posted at arxiv.org
-- hence we now exclude them all.</p>
<p>(In rare cases, we have found that accesses through proxy servers
strip the User-Agent information. If this is the case, you need to contact
the administrator of your proxy server to get it fixed.)</p>
<p>If you believe this determination to be in error, see
<b>http://arxiv.org/denied.html</b> for additional information.</p>
</body>
</html>
如果您在请求中使用例如用户代理 "Mozilla/5.0 (Windows NT 6.1; Trident/7.0; rv:11.0) like Gecko",它将起作用:
$options = array(
'http'=>array(
'method'=>"GET",
'header'=>"User-Agent: Mozilla/5.0 (Windows NT 6.1; Trident/7.0; rv:11.0) like Gecko\r\n"
)
);
$context = stream_context_create($options);
$str = file_get_contents($url, false, $context);
我正在尝试从 arXiv 页面中提取标题和摘要,例如 http://arxiv.org/abs/1207.0102,我的代码目前看起来像
function get_title($url){
$str = file_get_contents($url);
if(strlen($str)>0){
$str = trim(preg_replace('/\s+/', ' ', $str)); // supports line breaks inside <title>
preg_match("/\<title\>(.*)\<\/title\>/i",$str,$title); // ignore case
return $title[1];
}
}
echo get_title("http://arxiv.org/abs/1207.0102");
当我运行这段代码时,出现这个错误
Warning: file_get_contents(http://arxiv.org/abs/1207.0102): failed to open stream: HTTP request failed! HTTP/1.1 403 Forbidden in C:\wamp\www\mysite\Index.php
当我尝试不同的 url 时,这个问题不会发生,例如 http://www.washingtontimes.com/。
有人知道为什么会这样吗?
另外,是否可以从该网页中提取摘要?
不允许空用户代理的网站响应:
HTTP/1.1 403 Forbidden
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html>
<head><title>403 Forbidden</title></head>
<body>
<h1>Access Denied</h1>
<p>Sadly, your client does not supply a proper User-Agent,
and is consequently excluded.</p>
<p>We have an inordinate number of problems with automated scripts
which do not supply a User-Agent, and violate the automated access
guidelines posted at arxiv.org
-- hence we now exclude them all.</p>
<p>(In rare cases, we have found that accesses through proxy servers
strip the User-Agent information. If this is the case, you need to contact
the administrator of your proxy server to get it fixed.)</p>
<p>If you believe this determination to be in error, see
<b>http://arxiv.org/denied.html</b> for additional information.</p>
</body>
</html>
如果您在请求中使用例如用户代理 "Mozilla/5.0 (Windows NT 6.1; Trident/7.0; rv:11.0) like Gecko",它将起作用:
$options = array(
'http'=>array(
'method'=>"GET",
'header'=>"User-Agent: Mozilla/5.0 (Windows NT 6.1; Trident/7.0; rv:11.0) like Gecko\r\n"
)
);
$context = stream_context_create($options);
$str = file_get_contents($url, false, $context);