在 ruby 中使用正则表达式在文档中查找 url

Question

我一直试图在 html 文档中找到 url，这必须在正则表达式中完成，因为 url 不在任何 html 标记中所以我不能为此使用 nokogiri 。为了获得 html 我使用了 httparty 并且我是这样做的

require 'httparty'
doc = HTTParty.get("http://127.0.0.1:4040")
puts doc

输出html代码。为了获得 url，我使用 .split() 方法到达 url。完整代码为

require 'httparty'

doc = HTTParty.get('http://127.0.0.1:4040').split(".ngrok.io")[0].split('https:')[2]

puts "https:#{doc}.ngrok.io"

我想使用正则表达式来执行此操作，因为 ngrok 可能会更新其本地主机 html 文件，因此此代码将不再有效。我该怎么做？

Answer 1

如果我没理解错的话，您想查找所有与“https://（任何子域）”匹配的主机名。ngrok.io，对吗？

如果你想使用 String#scan 和正则表达式。这是一个例子：

# get your body (replace with your HTTP request)
body = "my doc contains https://subdomain.ngrok.io and https://subdomain-1.subdomain.ngrok.io"
puts body

# Use scan and you're done
urls = body.scan(%r{https://[0-9A-Za-z-\.]+\.ngrok\.io})
puts urls

它会生成一个包含 ["https://subdomain.ngrok.io", "https://subdomain-1.subdomain.ngrok.io"]

的数组

如果要删除重复项，请调用 .uniq

这并不能处理所有边缘情况，但它可能足以满足您的需要

在 ruby 中使用正则表达式在文档中查找 url

Find a url in a document using regex in ruby

ruby

httparty