如何检查 URL 中的文字标志？

Question

我想从 HTML 字符串中 select 徽标图像 url。我假设徽标图像 URL 将在其 URL 的某处包含文本 'logo'。

需要一个正则表达式，selects 图片 URLs 来自给定的 HTML 字符串文本。徽标 URL 的路径中将包含文本 'logo'。

/(https?:\/\/(?:www\.)?[\w+-_.0-9@\/]+logo.(?:png|jpg|jpeg))/i

["https://static.infragistics.com/marketing/Website/home/espn-logo.png", "https://static.infragistics.com/marketing/Website/home/mondelez-logo.png", "https://static.infragistics.com/marketing/Website/home/nielsen-logo.png", "https://static.infragistics.com/marketing/Website/home/united-logo.png", "https://static.infragistics.com/marketing/Website/home/merrill-lynch-logo.png", "https://static.infragistics.com/marketing/Website/home/dell-logo.png", "https://static.infragistics.com/marketing/Website/home/intel-logo.png", "https://static.infragistics.com/marketing/Website/home/prudential-logo.png", "https://static.infragistics.com/marketing/Website/home/mcdonalds-logo.png"]

文字标识可以出现在URL的任何地方。

需要一个正则表达式来选择其中包含文本 'logo' 的图像 URls。

Answer 1

也许，减少我们在表达式中的约束是个好主意，甚至 logo 的词边界可能没有，某些表达式类似于：

(?i)^(?=.*logo)(?:https)?:\/\/\S+(?:png|jpe?g|gif|tiff)$

其中，

(?=.*logo)

会简单地检查 URL 中是否有 logo。

如果我们只想检查图片名称中的单词 logo，

espn-logo.png
espn-logos.png

我们会在最后一个斜线之后向前移动我们的积极展望，例如：

(?i)^(?:https)?:\/\/\S+\/(?=.*logo).*(?:png|jpe?g|gif|tiff)$

我们的 desired image extensions 将进入这个非捕获组：

(?:png|jpe?g|gif|tiff|svg)

测试

re = /(?i)^(?=.*logo)(?:https)?:\/\/\S+(?:png|jpe?g|gif|tiff)$/s
str = 'https://static.infragistics.com/marketing/Website/home/espn-logo.png
https://static.infragistics.com/marketing/Website/home/mondelez-logo.gif
https://static.infragistics.com/marketing/Website/home/nielsen-logo.jpg
https://static.infragistics.com/marketing/Website/home/united-logo.jpeg
https://static.infragistics.com/marketing/Website/home/merrill-lynch-logo.PNG
https://static.infragistics.com/marketing/Website/home/dell-logo.TIFF
https://static.infragistics.com/marketing/Website/home/intel-logo.gif
https://static.infragistics.com/marketing/Website/home/prudential-logo.png
https://static.infragistics.com/marketing/Website/home/mcdonalds-logo.GIF
https://static.infragistics.com/marketing/Website/home/mcdonalds-alogo.GIF
https://static.infragistics.com/marketing/Website/home/mcdonalds-logos.GIF'

str.scan(re) do |match|
    puts match.to_s
end

表达式在 regex101.com, if you wish to explore/simplify/modify it, and in this link 的右上面板进行了解释，如果您愿意，可以观察它如何与一些示例输入匹配。

正则表达式电路

jex.im 可视化正则表达式：

编辑

对于那些我们有其他 URL 实例的情况，我们通常会为边缘情况添加更多约束，例如：

(?i)(?<=")\s*(?:https?)?:\/\/[^"]+\/(?=[^"]*logo)[^"]*(?:png|jpe?g|gif|tiff)\s*(?=")

DEMO

我想也许先收集图像 URL 会更容易，然后我们会检查图像名称中是否有 \blogo\b。否则，表达式可能会变得复杂。

如何检查 URL 中的文字标志？

How to check for the word logo in URLs?

regex

ruby-on-rails

image

capture-group

测试

正则表达式电路

编辑

DEMO