从文本中检索相对 URL
Retrieve relative urls from a text
我有一个包含绝对 URL 和相对 URL 的 HTML 字符串,我试图仅检索相对 URL。我尝试使用 get-urls
包,但这只会检索绝对 URL。
接收到html字符串的例子。
<!DOCTYPE>
<html>
<head>
<title>Our first HTML page</title>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
</head>
<body>
<h2>Welcome to the web site: this is a heading inside of the heading tags.</h2>
<p>This is a paragraph of text inside the paragraph HTML tags. We can just keep writing ...
</p>
<h3>Now we have an image:</h3>
<div><img src="/images/plantTracing.gif" alt="Graphic of a Mouse Pad"></div>
<h3>
This is another heading inside of another set of headings tags; this time the tag is an 'h3' instead of an 'h2' , that means it is a less important heading.
</h3>
<h4>Yet another heading - right after this we have an HTML list:</h4>
<ol>
<li><a href="https://github.com/">First item in the list</a></li>
<li><a href="/modules/example.md"> Second item in the list</a></li>
<li>Third item in the list</li>
</ol>
<p>You will notice in the above HTML list, the HTML automatically creates the numbers in the list.</p>
<h3>About the list tags</h3>
</body>
</html>
目前正在这样做
getUrls(string of HTML received
)
它只有 returns {https://github.com/
}
我想return{https://github.com/
,/modules/example.md
}
get-urls
包要求 URL 以 http://
等方案开头或以已知的顶级域开头。
事实上,该文档甚至包含此 要求 URLs 有一个方案或前导 www。被认为是 URL.
由于您要查找的相对路径没有这些路径,因此该程序包不会执行您想要的操作。
您可能会从实际的 HTML 解析器中获益最大,例如 cheerio
,它根据 HTML 上下文找到基于 URL 的 HTML 属性,不仅仅是文本匹配技巧,因为它会找到所有相对路径 URLs.
我有一个包含绝对 URL 和相对 URL 的 HTML 字符串,我试图仅检索相对 URL。我尝试使用 get-urls
包,但这只会检索绝对 URL。
接收到html字符串的例子。
<!DOCTYPE>
<html>
<head>
<title>Our first HTML page</title>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
</head>
<body>
<h2>Welcome to the web site: this is a heading inside of the heading tags.</h2>
<p>This is a paragraph of text inside the paragraph HTML tags. We can just keep writing ...
</p>
<h3>Now we have an image:</h3>
<div><img src="/images/plantTracing.gif" alt="Graphic of a Mouse Pad"></div>
<h3>
This is another heading inside of another set of headings tags; this time the tag is an 'h3' instead of an 'h2' , that means it is a less important heading.
</h3>
<h4>Yet another heading - right after this we have an HTML list:</h4>
<ol>
<li><a href="https://github.com/">First item in the list</a></li>
<li><a href="/modules/example.md"> Second item in the list</a></li>
<li>Third item in the list</li>
</ol>
<p>You will notice in the above HTML list, the HTML automatically creates the numbers in the list.</p>
<h3>About the list tags</h3>
</body>
</html>
目前正在这样做
getUrls(string of HTML received
)
它只有 returns {https://github.com/
}
我想return{https://github.com/
,/modules/example.md
}
get-urls
包要求 URL 以 http://
等方案开头或以已知的顶级域开头。
事实上,该文档甚至包含此 要求 URLs 有一个方案或前导 www。被认为是 URL.
由于您要查找的相对路径没有这些路径,因此该程序包不会执行您想要的操作。
您可能会从实际的 HTML 解析器中获益最大,例如 cheerio
,它根据 HTML 上下文找到基于 URL 的 HTML 属性,不仅仅是文本匹配技巧,因为它会找到所有相对路径 URLs.