PHP - 检查 url 是否有效

PHP - Check if url is valid or not

我正在检查 url & return "valid" if url status code "200" & "无效" 如果它在 "404",

urls 是 links,它重定向到某个页面 (url) & 我需要检查该页面 (url) 状态以确定它是否有效或无效取决于其状态代码。

<?php

// From URL to get redirected URL
$url = 'https://www.shareasale.com/m-pr.cfm?merchantID=83483&userID=1860618&productID=916465625';
  
// Initialize a CURL session.
$ch = curl_init();
  
// Grab URL and pass it to the variable.
curl_setopt($ch, CURLOPT_URL, $url);
  
// Catch output (do NOT print!)
curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);
  
// Return follow location true
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, TRUE);
$html = curl_exec($ch);
  
// Getinfo or redirected URL from effective URL
$redirectedUrl = curl_getinfo($ch, CURLINFO_EFFECTIVE_URL);
  
// Close handle
curl_close($ch);
echo "Original URL:   " . $url . "<br/> </br>";
echo "Redirected URL: " . $redirectedUrl . "<br/>";

 function is_url_valid($url) {
  $handle = curl_init($url);
  curl_setopt($handle, CURLOPT_RETURNTRANSFER, true);
  curl_setopt($handle, CURLOPT_NOBODY, true);
  curl_exec($handle);
 
  $httpCode = intval(curl_getinfo($handle, CURLINFO_HTTP_CODE));
  curl_close($handle);
 
  if ($httpCode == 200) {
    return 'valid link';
  }
  else {
    return 'invalid link';
  }
}

// 
echo "<br/>".is_url_valid($redirectedUrl)."<br/>";

如您所见,上面的 link 状态为 400,但仍显示“有效” 我正在使用上面的代码,有什么想法或更正吗?为了让它按预期工作? 似乎该网站有不止一个重定向 url & 脚本只检查一个,这就是它显示有效的原因。 有什么解决办法吗?

这是我正在检查的 link

问题 -

例如 - 如果我检查这个 link https://www.shareasale.com/m-pr.cfm?merchantID=66802&userID=1860618&productID=1186005518 然后在浏览器中它继续 "404" 但在脚本中 o/p 它 "200"

我为此使用 get_headers() 函数。如果我在数组中找到状态 2xx,那么 URL 没问题。

function urlExists($url){
  $headers = @get_headers($url);
  if($headers === false) return false;
  return preg_grep('~^HTTP/\d+\.\d+\s+2\d{2}~',$headers) ? true : false;
}

下面的代码运行良好,但是当我将 urls 放入数组并测试相同的功能时,它没有给出正确的结果? 任何想法为什么? 此外,如果任何机构想要更新答案以使其在某种意义上是动态的(当提供 url 数组时,应一次检查多个 url)。

  <?php
    
    // URL to check
    $url = 'https://www.shareasale.com/m-pr.cfm?merchantID=66802&userID=1860618&productID=1186005518';
      
    $ch = curl_init(); // Initialize a CURL session.
    curl_setopt($ch, CURLOPT_URL, $url); // Grab URL and pass it to the variable.
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE); // Catch output (do NOT print!)
    curl_setopt($ch, CURLOPT_FOLLOWLOCATION, TRUE); // Return follow location true
    $html = curl_exec($ch);
    $redirectedUrl = curl_getinfo($ch, CURLINFO_EFFECTIVE_URL); // Getinfo or redirected URL from effective URL
    curl_close($ch); // Close handle
    
    $get_final_url = get_final_url($redirectedUrl);
    if($get_final_url){
        echo is_url_valid($get_final_url);
    }else{
        echo $redirectedUrl ? is_url_valid($redirectedUrl) : is_url_valid($url);
    }
    
    function is_url_valid($url) {
      $handle = curl_init($url);
      curl_setopt($handle, CURLOPT_RETURNTRANSFER, true);
      curl_setopt($handle, CURLOPT_NOBODY, true);
      curl_exec($handle);
     
      $httpCode = intval(curl_getinfo($handle, CURLINFO_HTTP_CODE));
      curl_close($handle);
      echo $httpCode;
      if ($httpCode == 200) {
        return '<b> Valid link </b>';
      }
      else {
        return '<b> Invalid link </b>';
      }
    }
    
    function get_final_url($url) {
            $ch = curl_init();
            if (!$ch) {
                return false;
            }
            $ret = curl_setopt($ch, CURLOPT_URL,            $url);
            $ret = curl_setopt($ch, CURLOPT_HEADER,         1);
            $ret = curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
            $ret = curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
            $ret = curl_setopt($ch, CURLOPT_TIMEOUT,        30);
            $ret = curl_exec($ch);
    
            if (!empty($ret)) {
                $info = curl_getinfo($ch);
                curl_close($ch);
                return false;
            if (empty($info['http_code'])) {
                return false;
            } else {
                preg_match('#(https:.*?)\'\)#', $ret, $match);
                $final_url = stripslashes($match[1]);
                return stripslashes($match[1]);
            }
        }
    } 

看,这里的问题是你想跟随 JAVASCRIPT 重定向, 您抱怨 https://www.shareasale.com/m-pr.cfm?merchantID=66802&userID=1860618&productID=1186005518 的 url 确实重定向到 url 响应 HTTP 200 OK,并且该页面包含 javascript

<script LANGUAGE="JavaScript1.2">
                window.location.replace('https:\/\/www.tenthousandvillages.com\/bicycle-statue?sscid=71k5_4yt9r ')
                </script>

所以你的浏览器,它理解 javascript,遵循 javascript 重定向,并且那个 js 重定向是到一个 404 页面..不幸的是没有好的方法从 [=24 做到这一点=],你最好的选择可能是无头网络浏览器,例如 PhantomJS 或 puppeteer 或 Selenium 或类似的东西。

不过,您可以在正则表达式搜索中稍加修改 javascript 重定向并希望最好,例如

<?php
function is_url_valid(string $url):bool{
    if(0!==strncasecmp($url,"http",strlen("http"))){
        // file:///etc/passwd and stuff like that aren't considered valid urls right?
        return false;
    }
    $ch=curl_init();
    if(!curl_setopt_array($ch,array(
        CURLOPT_URL=>$url,
        CURLOPT_FOLLOWLOCATION=>1,
        CURLOPT_RETURNTRANSFER=>1
    ))){
        // best guess: the url is so malformed that even CURLOPT_URL didn't accept it.
        return false;
    }
    $resp= curl_exec($ch);
    if(false===$resp){
        return false;
    }
    if(curl_getinfo($ch,CURLINFO_RESPONSE_CODE) != 200){
        // only HTTP 200 OK is accepted
        return false;
    }
    // attempt to detect javascript redirects... sigh
    // window.location.replace('https:\/\/www.tenthousandvillages.com\/bicycle-statue?sscid=71k5_4yt9r ')
    $rex = '/location\.replace\s*\(\s*(?<redirect>(?:\'|\")[\s\S]*?(?:\'|\"))/';
    if(!preg_match($rex, $resp, $matches)){
        // no javascript redirects detected..
        return true;
    }else{
        // javascript redirect detected..
        $url = trim($matches["redirect"]);
        // javascript allows both ' and " for strings, but json only allows " for strings
        $url = str_replace("'",'"',$url);
        $url = json_decode($url, true,512,JSON_THROW_ON_ERROR); // we extracted it from javascript, need json decoding.. (well, strictly speaking, it needs javascript decoding, but json decoding is probably sufficient, and we only have a json decoder nearby)
        curl_close($ch);
        return is_url_valid($url);
    }
}
var_dump(

    is_url_valid('https://www.shareasale.com/m-pr.cfm?merchantID=66802&userID=1860618&productID=1186005518'),
    is_url_valid('http://example.org'),
    is_url_valid('http://example12k34jr43r5ehjegeesfmwefdc.org'),
    
);

但这是一个狡猾的 hacky 解决方案,委婉地说..

这是我对这个问题的看法。基本上,要点是:

  1. 您不需要提出多个请求。使用 CURLOPT_FOLLOWLOCATION 将为您完成所有工作,最后,您将获得的 http 响应代码是在 a/some 重定向的情况下来自最终调用的代码。
  2. 由于您使用的是 CURLOPT_NOBODY,请求将使用 HEAD 方法,不会 return 任何东西。因此,CURLOPT_RETURNTRANSFER 没用。
  3. 我冒昧地使用了我自己的编码风格(无意冒犯)。
  4. 因为我是 运行 来自 Phpstorm 的 Scratch 文件的代码,所以我添加了一些 PHP_EOL 作为换行符来格式化输出。随意删除它们。

...

<?php

$linksToCheck = [
    'https://click.linksynergy.com/link?id=GsILx6E5APM&offerid=547531.5112&type=15&murl=https%3A%2F%2Fwww.peopletree.co.uk%2Fwomen%2Fdresses%2Fanna-checked-dress',
    'https://click.linksynergy.com/link?id=GsILx6E5APM&offerid=330522.2335&type=15&murl=https%3A%2F%2Fwww.wearethought.com%2Fagnetha-black-floral-print-bamboo-dress-midnight-navy%2F%2392%3D1390%26142%3D198',
    'https://click.linksynergy.com/link?id=GsILx6E5APM&offerid=330522.752&type=15&murl=https%3A%2F%2Fwww.wearethought.com%2Fbernice-floral-tunic-dress%2F%2392%3D1273%26142%3D198',
    'https://click.linksynergy.com/link?id=GsILx6E5APM&offerid=330522.6863&type=15&murl=https%3A%2F%2Fwww.wearethought.com%2Fjosefa-smock-shift-dress-in-midnight-navy-hemp%2F%2392%3D1390%26142%3D208',
    'https://www.shareasale.com/m-pr.cfm?merchantID=16570&userID=1860618&productID=546729471',
    'https://www.shareasale.com/m-pr.cfm?merchantID=53661&userID=1860618&productID=680698793',
    'https://www.shareasale.com/m-pr.cfm?merchantID=66802&userID=1860618&productID=1186005518',
    'https://www.shareasale.com/m-pr.cfm?merchantID=83483&userID=1860618&productID=916465625',
];

function isValidUrl($url) {
    echo "Original URL:   " . $url . "<br/>\n";

    $handle = curl_init($url);

    // Follow any redirection.
    curl_setopt($handle, CURLOPT_FOLLOWLOCATION, TRUE);

    // Use a HEAD request and do not return a body.
    curl_setopt($handle, CURLOPT_NOBODY, true);

    // Execute the request.
    curl_exec($handle);

    // Get the effective URL.
    $effectiveUrl = curl_getinfo($handle, CURLINFO_EFFECTIVE_URL);
    echo "Effective URL:   " . $effectiveUrl . "<br/> </br>";

    $httpResponseCode = (int) curl_getinfo($handle, CURLINFO_HTTP_CODE);

    // Close this request.
    curl_close($handle);

    if ($httpResponseCode == 200) {
        return '✅';
    }
    else {
        return '❌';
    }
}

foreach ($linksToCheck as $linkToCheck) {
    echo PHP_EOL . "Result: " . isValidUrl($linkToCheck) . PHP_EOL . PHP_EOL;
}

注意:我们使用 CURLOPT_NOBODY 只是检查连接,而不是获取整个正文。

  $url = "Your URL";
  $curl = curl_init($url);
  curl_setopt($curl, CURLOPT_NOBODY, true);
  $result = curl_exec($curl);
 if ($result !== false)
 {
    $statusCode = curl_getinfo($curl, CURLINFO_HTTP_CODE);  
 if ($statusCode == 404)
 {
   echo "URL Not Exists"
 }
 else
 {
   echo "URL Exists";
  }
 }
else
{
  echo "URL not Exists";
}