从 onclick = Javascript 抓取 link 的正则表达式：Newwindow ()

Question

我需要从两种 html

中抓取一个 https link

一个是这样

          <a href="javascript:void(0)" onclick="javascript:newwindow1('https://hello.com/uploads/order/8c25ce592gfgfgfh99.pdf');">
this is some content  Lorem Ipsum Lorem Ipsum Lorem Ipsum &nbsp; <img src="/img/pdf.jpg" width="15"></a

还有一个是这样的

 <a href="javascript:void(0)" onclick="javascript:newwindow1('https://hello.com//webadmin/pdf/order/2018/Aug/hello this is regarding  an older document Ors._2018-08-31 12:09:12.pdf');">
    this is some content  Lorem Ipsum Lorem Ipsum Lorem Ipsum &nbsp; <img src="/img/pdf.jpg" width="15"></a>

两者的区别在于newwindow1中的link，第二个html中的link包含很少的空格并且 link 包含 string pdf 两次

现在我想从它们中提取 link 我正在使用 c#

Regex.Match(HtmlString, @"('https[^\s]+.pdf')");

通过这种方式，我可以从第一个 html 中提取 link，但在第二个 html 中，它是这样提取的

https://hello.com//webadmin/pdf/

从 https 开始，在 pdf 停止，但是 link 还没有完成

除了 regex 如果 html agility pack

可以做到这一点，请告诉我

Answer 1

使用 HtmlAgilityPack，您可以解析 HTML DOM 文档，但无法解析 JavaScript 代码。

如果您知道代码的格式总是按照问题中显示的方式设置，则您只能使用正则表达式，即如果您需要提取的值始终在单引号内。然后，您可以使用 [^'] 否定字符 class 匹配任何字符但单引号而不是 [^\s] 匹配任何字符但空白字符。

var url = Regex.Match(HtmlString, @"'https[^']+\.pdf'");

或者，只获取不带单引号的 URL：

var url = Regex.Match(HtmlString, @"'(https[^']+\.pdf)'")?.Groups[1].Value;

请注意，您应该将模式中字符 class 外的点转义以匹配文字点。

从 onclick = Javascript 抓取 link 的正则表达式：Newwindow ()

Regex to scraping link from onclick = Javascript : Newwindow ()

javascript

c#

regex

html-agility-pack