用于从字符串中查找 http 和 https url 的正则表达式

Question

我有一个包含多个 url 的字符串，从 http 和 https 开始我需要获取所有这些 url 并放入列表中。

我试过下面的代码。

List<String> httpLinksList = new ArrayList<>();

String hyperlinkRegex = "((http:\/\/|https:\/\/)?(([a-zA-Z0-9-]){2,}\.){1,4}([a-zA-Z]){2,6}(\/([a-zA-Z-_\/\.0-9#:?=&;,]*)?)?)";

字符串概要 = "This is http://whosebug.com/questions and https://test.com/method?param=wasd The code below catches all urls in text and returns urls in list";

    Pattern pattern = Pattern.compile(hyperlinkRegex);
    Matcher matcher = pattern.matcher(synopsis);
    while(matcher.find()){
        System.out.println(matcher.find()+"  "+matcher.group(1)+"  "+matcher.groupCount()+"  "+matcher.group(2));

        httpLinksList.add(matcher.group());
    }

    System.out.println(httpLinksList);

我需要以下结果 [http://whosebug.com/questions, https://test.com/method?param=wasd] 但是低于输出 [https://test.com/method?param=wasd]

Answer 1

所以我知道这不完全是您的要求，因为您专门在寻找 regex，但我认为尝试使用 indexOf 变体会很有趣。我将把它留在这里作为某人提出的正则表达式的替代：

public static void main(String[] args){
   String synopsis = "This is http://whosebug.com/questions and https://test.com/method?param=wasd The code below catches all urls in text and returns urls in list";

    ArrayList<String> list = splitUrl(synopsis);
    for (String s : list) {
        System.out.println(s);
    }
}

public static ArrayList<String> splitUrl(String s) 
{
    ArrayList<String> list = new ArrayList<>();
    int spaceIndex = 0;
    while (true) {
        int httpIndex = s.indexOf("http", spaceIndex);
        if (httpIndex < 0) {
            break;
        }

        spaceIndex = s.indexOf(" ", httpIndex);
        if (spaceIndex < 0) {
            list.add(s.substring(httpIndex));
            break;
        }
        else {
            list.add(s.substring(httpIndex, spaceIndex));
        }
    }
    return list;
}

所有的逻辑都包含在splitUrl(String s)方法中，它接受一个String作为参数，输出所有拆分的url的ArrayList<String>。

它首先搜索任何 http 的索引，然后搜索出现在 url 之后的第一个 space 和子字符串差异。然后它使用它找到的 space 作为 indexOf(String, int) 中的第二个参数开始搜索 String 从已经找到的 http 开始，所以它不会重复相同的。

此外，当 http 是 String 的最后部分时，必须提出一个案例，因为之后没有 space。这是在 indexOf space returns 否定时完成的，我使用 substring(int) 而不是 substring(int, int) 它将获取当前位置并子字符串 rest 的字符串。

当 indexOf returns 为负数时，循环结束，但如果 space returns 为负数，它会在 substring 之前执行最后的操作break.

输出：

http://whosebug.com/questions

https://test.com/method?param=wasd

注意： 正如评论中有人提到的那样，此实现也适用于非拉丁字符，例如平假名，这可能比正则表达式更有优势。

Answer 2

这个正则表达式将匹配所有有效的 url，包括 FTP 和其他

String urlRegex = "((https?|ftp|gopher|telnet|file):((//)|(\\))+[\w\d:#@%/;$()~_?\+-=\\\.&]*)";

import java.util.ArrayList;
import java.util.List;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class xmlValue {
    public static void main(String[] args) {
String text = "This is http://whosebug.com/questions and https://test.com/method?param=wasd The code below catches all urls in text and returns urls in list";
        System.out.println(extractUrls(text));
    }

    public static List<String> extractUrls(String text)
    {
        List<String> containedUrls = new ArrayList<String>();
        String urlRegex = "((https?|ftp|gopher|telnet|file):((//)|(\\))+[\w\d:#@%/;$()~_?\+-=\\\.&]*)";
        Pattern pattern = Pattern.compile(urlRegex, Pattern.CASE_INSENSITIVE);
        Matcher urlMatcher = pattern.matcher(text);

        while (urlMatcher.find())
        {
            containedUrls.add(text.substring(urlMatcher.start(0),
                    urlMatcher.end(0)));
        }

        return containedUrls;
    }
}

输出:

[http://whosebug.com/questions, https://test.com/method?param=wasd]

感谢@BullyWiiPlaza

用于从字符串中查找 http 和 https url 的正则表达式

Regex for finding http and https url from a string

java

regex

pattern-matching

matcher