
not extracting meta tags data from my regular expression

我在 php variable

$data="<meta charset='UTF-8'>
<meta name='keywords' content='your, tags'>
<meta name='description' content='150 words'>
<meta name='subject' content='your website's subject'>
<meta name='copyright' content='company name'>
<meta name='language' content='ES'>
<meta name='robots' content='index,follow'>
<meta name='revised' content='Sunday, July 18th, 2010, 5:15 pm'>
<meta name='abstract' content=''>
<meta name='topic' content=''>
<meta name='summary' content=''>
<meta name='Classification' content='Business'>
<meta name='author' content='name, email@hotmail.com'>
<meta name='designer' content=''>
<meta name='reply-to' content='email@hotmail.com'>
<meta name='owner' content=''>
<meta name='url' content='http://www.websiteaddrress.com'>
<meta name='identifier-URL' content='http://www.websiteaddress.com'>
<meta name='directory' content='submission'>
<meta name='pagename' content='jQuery Tools, Tutorials and Resources - O'Reilly Media'>
<meta name='category' content=''>
<meta name='coverage' content='Worldwide'>
<meta name='distribution' content='Global'>
<meta name='rating' content='General'>
<meta name='revisit-after' content='7 days'>
<meta name='subtitle' content='This is my subtitle'>
<meta name='target' content='all'>
<meta name='HandheldFriendly' content='True'>
<meta name='MobileOptimized' content='320'>
<meta name='date' content='Sep. 27, 2010'>
<meta name='search_date' content='2010-09-27'>
<meta name='DC.title' content='Unstoppable Robot Ninja'>
<meta name='ResourceLoaderDynamicStyles' content=''>
<meta name='medium' content='blog'>
<meta name='syndication-source' content='https://mashable.com/2008/12/24/free-brand-monitoring-tools/'>
<meta name='original-source' content='https://mashable.com/2008/12/24/free-brand-monitoring-tools/'>
<meta name='verify-v1' content='dV1r/ZJJdDEI++fKJ6iDEl6o+TMNtSu0kv18ONeqM0I='>
<meta name='y_key' content='1e39c508e0d87750'>
<meta name='pageKey' content='guest-home'>
<meta itemprop='name' content='jQTouch'>
<meta http-equiv='Expires' content='0'>
<meta http-equiv='Pragma' content='no-cache'>
<meta http-equiv='Cache-Control' content='no-cache'>
<meta http-equiv='imagetoolbar' content='no'>
<meta http-equiv='x-dns-prefetch-control' content='off'>";

我想提取列出的元标记的值,包括名称元标记和 httpequiv 元标记


// explode the string by newline

// loop through each meta tag line
foreach($parts as $part){

  // match inside the name attribute and the content attribute
  preg_match("/<meta name=\"(.*)\" content=\"(.*)\" \/>/i",$part,$matches);

  // returns "</pre><pre>Array()"
  print "<pre>".print_r($matches,true)."</pre>";


使用单引号而不是双引号的属性。结束标记不是 /> 而是 > 没有 space:

preg_match("/<meta name='([^']*)' content='([^']*)'\s?\/?>/i", $part, $matches);


[^']* # get all data until ' is reached
\s?   # with whitespace character (\s), or not (?)
\/?   # with slash (/) or not (?) 

这是一个也使用双引号和多个 space 的版本:


-> online demo

但是,使用 DOM 解析器检查 HTML 元素总是更好。

在正常情况下,最好/最可靠的建议是使用 DomDocument 或其他一些专用的 HTML 解析工具来解析您的 html。

这是实现 DomDocument 和 Xpath 的解决方案的样子:


$result = [];
$dom=new DOMDocument;
libxml_use_internal_errors(true);  // do not display warnings on erroneous lines
$xpath = new DOMXPath($dom);
foreach ($xpath->evaluate("//meta[(@name or @http-equiv) and @content]") as $node) {  // target <meta> tags containing both a name & content attribute
    if ($value = $node->getAttribute('name')) {
        $attr = 'name';
    } else {
        $attr = 'http-equiv';
        $value = $node->getAttribute('http-equiv'); 
    $result[]=[$attr => $value, 'content' => $node->getAttribute('content')];



捕获目标数据而不截断这些值的最直接解决方法是使用 a greedy regex pattern


preg_match_all("~<meta (?:name|http-equiv)='(.*)' content='(.*)'>~", $html, $matches, PREG_SET_ORDER)

这将适用于您的模式,因为您的输入数据具有看似严格的格式。目标行具有完全相同的 2 个目标属性,并且顺序相同,没有要容纳的额外字符。贪婪的 * 量词将匹配零个或多个字符(努力匹配尽可能多的字符——包括撇号),同时遵守其他模式要求。此模式不会截断您的属性值。我正在使用 PREG_SET_ORDER 将您的元标记数据组合在一起——您不必将其用于您的实际项目。这里有一个Demo of the regex method and a commented out DomDocument method that demonstrates the quoting issue.