如何在不修改 pre-existing <img> 和 <a> 标签的情况下用超链接替换特定文本？

Question

这是我正在尝试更正的错误

<img class="lazy_responsive" title="<a href='kathryn-kuhlman-language-en-topics-718-page-1' title='Kathryn Kuhlman'>Kathryn Kuhlman</a> - iUseFaith.com" src="ojm_thumbnail/1000/32f808f79011a7c0bd1ffefc1365c856.jpg" alt="<a href='kathryn-kuhlman-language-en-topics-718-page-1' title='Kathryn Kuhlman'>Kathryn Kuhlman</a> - iUseFaith.com" width="1600" height="517">

如果仔细查看上面的代码，您会发现 属性 alt 和标题 中的文本已替换为 link，因为关键字在该文本中。结果，我的图像显示为带有 link 而不是像这样的名称的工具提示

问题：我有一个包含关键字的数组，其中每个关键字都有自己的 URL，它将用作 link，如下所示：

$keywords["Kathryn Kuhlman"] = "https://www.iusefaith.com/en-354";
$keywords["Max KANTCHEDE"] = "https://www.iusefaith.com/MaxKANTCHEDE";

我有一个带有图像和 links 的文本...可以在其中找到这些关键字。

$text='Meet God\'s General Kathryn Kuhlman. <br>
<img class="lazy_responsive" title="Kathryn Kuhlman - iUseFaith.com" src="https://www.iusefaith.com/ojm_thumbnail/1000/32f808f79011a7c0bd1ffefc1365c856.jpg" alt="Kathryn Kuhlman - iUseFaith.com" width="1600" height="517" />
<br>
Follow <a href="https://www.iusefaith.com/en-354" title="Kathryn Kuhlman">Kathryn Kuhlman</a>
<br>
Max KANTCHEDE
';

我想用完整的 link 替换每个带有标题的关键字而不替换 href 的内容和 的内容alt 或文本中 title 的内容。我这样做了

$lien_existants = array();

$regexp = "<a\s[^>]*href=(\"??)([^\" >]*?)\1[^>]*>(.*)<\/a>";

if(preg_match_all("/$regexp/siU", $text, $matches, PREG_SET_ORDER)) 
{
    foreach($matches as $match) 
    {
        $lien_actuels_existant = filter_var($match[3], FILTER_SANITIZE_STRING);
        $lien_existants [] = trim($lien_actuels_existant);
          
        // $match[2] = link address
        // $match[3] = link text
        
        echo $match[2], '', $match[3], '<br>';
    }
}   

foreach(@$keywords as $name => $value) 
{
    if(!in_array($name, $lien_existants)&&!preg_match("/'/i", $name)&&!preg_match('/"/i', $name))
    {
        $text =  trim(preg_replace('~(\b'. $name.'\b)~ui', "<a href='$value' title='$name'></a>", $text));
    }
    else
    {
        $name = addslashes($name);
        $text =  trim(preg_replace('~(\b'. $name.'\b)~ui', "<a href='$value' title='$name'></a>", $text));
    }
    ######################################### 
}

这会用 links 替换单词，但也会替换属性 alt，图像中的标题。

如何防止它替换 alt、title 和 href 中的文本？

请注意，我已经尝试了我在 S.O 上找到的所有其他解决方案，所以如果您认为一个可行，请使用我上面的代码并告诉我应该如何完成，因为如果我知道如何让它工作我不会在这里问的。

Answer 1

正则表达式不是处理 HTML 内容的最佳方式。

这是一个带有 DOM 操作的解决方案。代码应为 self-explanatory 并提供注释。

我们的想法是搜索所有不是 link 或图像子节点的文本节点和 search/replace 您想要的术语。

<?php
    
    $keywords["Kathryn Kuhlman"] = "https://www.iusefaith.com/en-354";
    $keywords["Max KANTCHEDE"] = "https://www.iusefaith.com/MaxKANTCHEDE";
    
    $text='Meet God\'s General Kathryn Kuhlman. <br>
    <img class="lazy_responsive" title="Kathryn Kuhlman - iUseFaith.com" src="https://www.iusefaith.com/ojm_thumbnail/1000/32f808f79011a7c0bd1ffefc1365c856.jpg" alt="Kathryn Kuhlman - iUseFaith.com" width="1600" height="517" />
    <br>
    Follow <a href="https://www.iusefaith.com/en-354" title="Kathryn Kuhlman">Kathryn Kuhlman</a>
    <br>
    Max KANTCHEDE
    ';
    
    
    // Format the replacement
    foreach($keywords as $name => &$value) {
        $value = '<a href="'.$value.'" title="'.$name.'">'.$name.'</a>';
    }
    
    // Load a DomDocument with our html
    $doc = new DOMDocument();
    $doc->loadHTML('<html><body>' . $text . '</body></html>');
    
    // Search through xpath all text elements which are not parent of an img or a element
    $xpath = new DOMXPath($doc);
    $textnodes = $xpath->query('//*[not(self::img or self::a)]/text()');
    
    // For each text node replace words found by the link
    foreach($textnodes as $textnode) {
        $html = str_replace(array_keys($keywords), array_values($keywords), $textnode->nodeValue, $count);
        if ($count) {
            $newelement = $doc->createDocumentFragment();
            $newelement->appendXML($html);
            $textnode->parentNode->replaceChild($newelement, $textnode);
        }
    }
    
    // Retrieve body html
    $body_element = $doc->getElementsByTagName('body');
    $body = $doc->savehtml($body_element->item(0));
    
    // Remove wrapping <body></body>
    echo substr($body, 6, strlen($body)-13);

您可以使用 str_ireplace 而不是 str_replace 进行不区分大小写的搜索

Answer 2

通过在您不想想要替换的所有关键字之前临时添加一个唯一的“标记字符串”，可以使用正则表达式实现这一点 - 请参阅 this regex101 demo 和以下内容代码：

// Define a marker string - could be anything that is very unlikely to appear in the
// text. (But don't include any characters that would need to be escaped in a regex).
$marker = '¬¦@#~';

// Construct regex alternation syntax for all the keywords.
// E.g: (Kathryn Kuhlman|Max KANTCHEDE|Another one)
$alt_keywords = '('.join('|', array_keys($keywords)).')';

// Double quotes: Prepend marker to keywords in href="...", alt="..." or title="..."
$text = preg_replace(
    '/((?:href|alt|title)\s*=\s*"[^"]*)'.$alt_keywords.'/',
    "$marker",
    $text);

// Single quotes: Prepend marker to keywords in href='...', alt='...' or title='...'
$text = preg_replace(
    "/((?:href|alt|title)\s*=\s*'[^']*)$alt_keywords/",
    "$marker",
    $text);

// Optional step - not explicitly requested in the question but seems necessary:
// Prepend marker to keywords found within anchor tags / end tags: <a>...</a>
$text = preg_replace(
    "/(<a(?:\s+[^>]*)?>[^<]*)$alt_keywords([^<]*<\/a\s*>)/",
    "$marker",
    $text);

Negative lookbehind can then be used to only make replacements where the marker text isn't present - see this regex101 demo 和以下代码：

foreach($keywords as $name => $url) {
  $text = preg_replace(
      "/(?<!$marker)$name/",
      "<a href=\"$url\" title=\"$name\">$name</a>",
      $text);
}

// Now clean up by removing all instances of the marker text
$text = str_replace($marker, '', $text);

演示

This Rextester demo 显示上面的代码适用于问题中的示例值。

Answer 3

我认为@Jiwoks 的回答是正确的，使用 dom 解析调用来隔离符合条件的文本节点。

虽然他的回答适用于 OP 的示例数据，但我不满意地发现，当在单个文本节点中有多个字符串要替换时，他的解决方案失败了。

我精心设计了自己的解决方案，目标是适应 case-insensitive 匹配、word-boundary、文本节点中的多个替换以及插入的完全限定节点（不仅仅是新字符串 [= =52=]看起来像子节点）。

代码：(Demo #1 with 2 replacements in a text node) (Demo #2: with OP's text)
(After receiving fuller, more realistic text from the OP: Demo #3 without trimming saveHTML())

$html = <<<HTML
Meet God's General Kathryn Kuhlman. <br>
<img class="lazy_responsive" title="Kathryn Kuhlman - iUseFaith.com" src="https://www.iusefaith.com/ojm_thumbnail/1000/32f808f79011a7c0bd1ffefc1365c856.jpg" alt="Kathryn Kuhlman - iUseFaith.com" width="1600" height="517" />
<br>
Follow <a href="https://www.iusefaith.com/en-354" title="Kathryn Kuhlman">Kathryn Kuhlman</a>
<br>
Max KANTCHEDE & Kathryn Kuhlman
HTML;

$keywords = [
    'Kathryn Kuhlman' => 'https://www.example.com/en-354',
    'Max KANTCHEDE' => 'https://www.example.com/MaxKANTCHEDE',
    'eneral' => 'https://www.example.com/this-is-not-used',
];

libxml_use_internal_errors(true);
$dom = new DOMDocument();
$dom->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);

$xpath = new DOMXPath($dom);

$lookup = [];
$regexNeedles = [];
foreach ($keywords as $name => $link) {
    $lookup[strtolower($name)] = $link;
    $regexNeedles[] = preg_quote($name, '~');
}
$pattern = '~\b(' . implode('|', $regexNeedles) . ')\b~i' ;

foreach($xpath->query('//*[not(self::img or self::a)]/text()') as $textNode) {
    $newNodes = [];
    $hasReplacement = false;
    foreach (preg_split($pattern, $textNode->nodeValue, 0, PREG_SPLIT_NO_EMPTY | PREG_SPLIT_DELIM_CAPTURE) as $fragment) {
        $fragmentLower = strtolower($fragment);
        if (isset($lookup[$fragmentLower])) {
            $hasReplacement = true;
            $a = $dom->createElement('a');
            $a->setAttribute('href', $lookup[$fragmentLower]);
            $a->setAttribute('title', $fragment);
            $a->nodeValue = $fragment;
            $newNodes[] = $a;
        } else {
            $newNodes[] = $dom->createTextNode($fragment);
        }
    }
    if ($hasReplacement) {
        $newFragment = $dom->createDocumentFragment();
        foreach ($newNodes as $newNode) {
            $newFragment->appendChild($newNode);
        }
        $textNode->parentNode->replaceChild($newFragment, $textNode);
    }
}
echo substr(trim($dom->saveHTML()), 3, -4);

输出：

Meet God's General <a href="https://www.example.com/en-354" title="Kathryn Kuhlman">Kathryn Kuhlman</a>. <br>
<img class="lazy_responsive" title="Kathryn Kuhlman - iUseFaith.com" src="https://www.iusefaith.com/ojm_thumbnail/1000/32f808f79011a7c0bd1ffefc1365c856.jpg" alt="Kathryn Kuhlman - iUseFaith.com" width="1600" height="517">
<br>
Follow <a href="https://www.iusefaith.com/en-354" title="Kathryn Kuhlman">Kathryn Kuhlman</a>
<br>
<a href="https://www.example.com/MaxKANTCHEDE" title="Max KANTCHEDE">Max KANTCHEDE</a> &amp; <a href="https://www.example.com/en-354" title="Kathryn Kuhlman">Kathryn Kuhlman</a>

一些说明点：

我正在使用一些 DomDocument 沉默和标志，因为示例输入缺少包含所有文本的父标记。（@Jiwoks 的技术没有错，这只是一个不同的技术——选择你喜欢的任何东西。）
已声明具有小写键的查找数组以允许 case-insensitive 对符合条件的文本进行翻译。
正则表达式模式是动态构造的，因此应该 preg_quote()ed 以确保支持模式逻辑。 b 是一个单词边界元字符，用于防止匹配较长单词中的子字符串。请注意，eneral 未替换为输出中的 General。 case-insensitive 标志 i 将为该应用程序和未来的应用程序提供更大的灵活性。
我的 xpath 查询与@Jiwoks 的相同；如果没有理由改变它。它正在寻找不是 <img> 或 <a> 标签的子节点的文本节点。

...现在有点繁琐...现在我们正在处理孤立的文本节点，正则表达式可用于区分符合条件的字符串和 non-qualifying 字符串。

preg_split() 正在创建 non-empty 子字符串的平面索引数组。符合翻译条件的子串将被隔离为元素，如果有任何 non-qualifying 个子串，它们将被隔离。

我示例中的最终文本节点将生成 4 个元素：

0 => '
',                                 // non-qualifying newline
1 => 'Max KANTCHEDE',              // translatable string
2 => ' & ',                        // non-qualifying text
3 => 'Kathryn Kuhlman'             // translatable string

对于可翻译的字符串，创建新的 <a> 节点并填充适当的属性和文本，然后将其推入临时数组。
对于 non-translatable 个字符串，创建文本节点，然后将其推入临时数组。
如果有任何translations/replacements已经完成，那么dom被更新；否则，不需要对文档进行修改。
最后，最终的 html 文档被回显，但是因为您的示例输入有一些不在标签内的文本，临时前导 <p> 和尾随</p> 必须删除 DomDocument 为稳定而申请的标签，才能将结构恢复到原来的形式。如果所有文本都包含在标签中，您可以只使用 saveHTML() 而无需对字符串进行任何修改。

如何在不修改 pre-existing <img> 和 <a> 标签的情况下用超链接替换特定文本？

How to replace specific text with hyperlinks without modifying pre-existing <img> and <a> tags?

html

php

regex

parsing

domparser

演示