已修改 PHP get_meta_tags 不适用于某些网址

Question

我正在尝试使用 user contributed notes on php.net for the get_meta_tags function. From what it seems, if the meta tag is formatted <meta content="foo" name="bar" /> then the code will miss it. Currently, only tags formatted as <meta name="bar" content="foo"/> will work. I'm not great with regex and tried unsuccessfully to fix it. Here is an example of a url 中的代码，它似乎可以通过正则表达式。提前道歉，我的问题不一定是关于 get_meta_tags 功能，但这似乎与人们一直在使用该功能遇到的其他一些问题不太相关。

看来问题出在附近：

preg_match_all('/<[\s]*meta[\s]*(name|property)="?' . '([^>"]*)"?[\s]*' . 'content="?([^>"]*)"?[\s]*[\/]?[\s]*>/si', $contents, $match);

可能需要像这样：

preg_match_all('/<[\s]*meta[\s]*(name|property|content)="?' . '([^>"]*)"?[\s]*' . '(content|name)="?([^>"]*)"?[\s]*[\/]?[\s]*>/si', $contents, $match);

但是，我再次对正则表达式感到非常糟糕。有什么想法吗？

Answer 1

一个想法是在 lookahead 中捕获元 name/property 以独立于序列：

function extract_meta_tags($source)
{
  $pattern = '
  ~<\s*meta\s

  # using lookahead to capture type to 
    (?=[^>]*?
    \b(?:name|property|itemprop|http-equiv)\s*=\s*
    (?|"\s*([^"]*?)\s*"|\'\s*([^\']*?)\s*\'|
    ([^"\'>]*?)(?=\s*/?\s*>|\s\w+\s*=))
  )

  # capture content to 
  [^>]*?\bcontent\s*=\s*
    (?|"\s*([^"]*?)\s*"|\'\s*([^\']*?)\s*\'|
    ([^"\'>]*?)(?=\s*/?\s*>|\s\w+\s*=))
  [^>]*>

  ~ix';

  if(preg_match_all($pattern, $source, $out))
    return array_combine(array_map('strtolower', $out[1]), $out[2]);
  return array();
}

请参阅 test at regex101. Used the branch reset 功能以提取不同引用样式的值。

print_r(extract_meta_tags($str)); 尝试一些不同的数据 at eval.in

在 html <head> 部分使用它。获取页面源并提取头部：

1.) 使用 cURL, file_get_contents or fsockopen.

获取源代码

2.) 使用dom or regex like this: (?is)<head\b[^>]*>(.*?)</head>

提取<head>

3.) 使用提供的正则表达式或 try with a parser.

从 <head> 中提取元标记

已修改 PHP get_meta_tags 不适用于某些网址

Modified PHP get_meta_tags not working for some URLs

php

regex

meta-tags