如何将不可点击的纯文本 URL 转换为 HTML 源中的链接

Question

我想检测 URL 并在 HTML 代码中使它们成为 link。我搜索了 Stack Overflow，但很多答案都是关于检测和转换文本字符串中的 links 的。当我这样做时 html 代码将无效； IE。 img 来源会改变等等

P.S: Close voters: 请仔细阅读问题！不重复。

例如；第 1 行需要转换，第 2 和 3 行不需要。

<!-- Sample html source -->
<div>
   Line 1 : https://www.google.com/
   Line 2 : <a href="https://www.google.com/">https://www.google.com/</a>
   Line 3: <img src="http://a-domain.com/lovely-image.jpg">
</div>

我需要：

Find any URL in html body part

Check if it is clickable or not: If not wrapped by 'a', 'img', '!--', etc..

If not make it clickable: Wrap with 'a'

我该怎么做？所有 C# 和 JS 版本对我来说都可以。

最新更新 将项目构建目标从 4.7.2 更改为 4.5 并返回到 4.7.2 修复了 "bug"。

更新： 这是我在@jira 帮助下的解决方案这里的问题是节点根本不会改变。我的意思是递归函数完成了工作，替换了 links，调试说，但是 html 文档根本不会更新。函数内部的任何修改都不会影响函数外部，我不知道为什么，InnerText 更改 - InnerHtml 不更改

var htmlVersion = "<html><head></head><body>\r\n"
   + "Some text\r\n"
   + "<div>http://google.com</div>\r\n"
   + " Then later more text: http://500px.com\r\n"
   + "<div>Sub <span>abc</span> Back text</div>\r\n"
   + "And the final text"
   + "</body></html>";

var doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(htmlVersion);

// Linkify body
var modified = false;
var bodyNode = doc.DocumentNode.SelectSingleNode("//body"); 
var before = bodyNode.InnerHtml;
bodyNode = Linkify(bodyNode);
modified = modified || bodyNode.InnerHtml != before;
// modified is false !!!

递归Linkify函数：

HtmlAgilityPack.HtmlNode Linkify(HtmlAgilityPack.HtmlNode node)
{
    if (node.Name == "a") // It's already a link
    {
        return node;
    }

    if (node.Name == "#text") // Do replacement here
    {

        // Create links
        // 
        node.InnerHtml = Regex.Replace(node.InnerHtml,
            @"((http|ftp|https):\/\/[\w\-_]+(\.[\w\-_]+)+([\w\-\.,@?^=%&amp;:/~\+#]*[\w\-\@?^=%&amp;/~\+#])?)",
            "<a target='_blank' href=''></a>");

    }

    for (int i = 0; i < node.ChildNodes.Count; i++) // Go for child nodes
    {
        node.ChildNodes[i] = Linkify(node.ChildNodes[i]);
    }
    return node;
}

Answer 1

像 HtmlAgility Pack 一样使用 html 解析器。 Select 仅文本节点，然后在其中搜索链接。这样你就不会触及现有链接。根据您需要的精确度，您可以使用正则表达式。

例如

var doc = new HtmlDocument();
doc.LoadHtml(html);
Regex r = new Regex(@"(https?://[^\s]+)");
var textNodes = doc.DocumentNode.SelectNodes("//text()");

foreach (var textNode in textNodes) {
    var text = textNode.GetDirectInnerText();
    var withLinks = r.Replace(text, "<a href=\"\"></a>");
    textNode.InnerHtml = withLinks;
}

Fiddle

正确检查链接的正则表达式可能会变得相当复杂。在 SO 上查看其他答案。

Answer 2

将项目构建目标从 4.7.2 更改为 4.5 并再次返回 4.7.2 后修复了 "bug"。

这是工作代码：

var htmlVersion = "<html><head></head><body>\r\n"
   + "Some text\r\n"
   + "<div>http://google.com</div>\r\n"
   + " Then later more text: http://500px.com\r\n"
   + "<div>Sub <span>abc</span> Back text</div>\r\n"
   + "And the final text"
   + "</body></html>";

var doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(htmlVersion);

// Linkify body
var modified = false;
var bodyNode = doc.DocumentNode.SelectSingleNode("//body"); 
var before = bodyNode.InnerHtml;
bodyNode = Linkify(bodyNode);
modified = modified || bodyNode.InnerHtml != before;

递归Linkify函数：

HtmlAgilityPack.HtmlNode Linkify(HtmlAgilityPack.HtmlNode node)
{
    if (node == null || node.Name == "a") // It's already a link
    {
        return node;
    }

    if (node.Name == "#text") // Do replacement here
    {

        // Create links
        // 
        node.InnerHtml = Regex.Replace(node.InnerHtml,
            @"((http|ftp|https):\/\/[\w\-_]+(\.[\w\-_]+)+([\w\-\.,@?^=%&amp;:/~\+#]*[\w\-\@?^=%&amp;/~\+#])?)",
            "<a target='_blank' href=''></a>");

    }

    for (int i = 0; i < node.ChildNodes.Count; i++) // Go for child nodes
    {
        node.ChildNodes[i] = Linkify(node.ChildNodes[i]);
    }
    return node;
}

如何将不可点击的纯文本 URL 转换为 HTML 源中的链接

How to convert unclickable plain text URLs to links in HTML source

javascript

c#

html-agility-pack