Java 用于提取 id 字符串的正则表达式，基于每个 id 的重复子字符串

Question

我正在读取日志文件并提取文件中包含的某些数据。我能够提取日志文件每一行的时间。

现在我想提取 id "ieatrcxb4498-1"。所有 id 都以我尝试查询的子字符串 ieatrcxb 和基于它的完整字符串 return 开头。

我尝试了很多来自其他帖子的不同建议。但是我一直没有成功，模式如下：

(?i)\b("ieatrcxb"(?:.+?)?)\b
(?i)\b\w*"ieatrcxb"\w*\b"
^.*ieatrcxb.*$

我还尝试根据以 i 开始并在 1 结束的字符串提取完整的 ID。就像他们一样。

日志文件行

150: 2017-06-14 18:02:21 INFO  monitorinfo           :     Info: Lock VCS on node "ieatrcxb4498-1"

代码

Scanner s = new Scanner(new FileReader(new File("lock-unlock.txt")));
    //Record currentRecord = null;
    ArrayList<Record> list = new ArrayList<>();

    while (s.hasNextLine()) {
        String line = s.nextLine();

        Record newRec = new Record();
        // newRec.time =
        newRec.time = regexChecker("([0-1]?\d|2[0-3]):([0-5]?\d):([0-5]?\d)", line);

        newRec.ID = regexChecker("^.*ieatrcxb.*$", line);

        list.add(newRec);

    }


public static String regexChecker(String regEx, String str2Check) {

    Pattern checkRegex = Pattern.compile(regEx);
    Matcher regexMatcher = checkRegex.matcher(str2Check);
    String regMat = "";
    while(regexMatcher.find()){
        if(regexMatcher.group().length() !=0)
            regMat = regexMatcher.group();
        }
        //System.out.println("Inside the "+ regexMatcher.group().trim());
    }

     return regMat;
}

我需要一个简单的模式来为我做这件事。

Answer 1

public static void main(String[] args) {
    String line = "150: 2017-06-14 18:02:21 INFO  monitorinfo           :     Info: Lock VCS on node \"ieatrcxb4498-1\"";
    String regex ="ieatrcxb.*1";
    Pattern p = Pattern.compile(regex);
    Matcher m = p.matcher(line);
    while(m.find()){
        System.out.println(m.group());
    }
}

或者如果 ID 都被引用：

 String id = line.substring(line.indexOf("\""), line.lastIndexOf("\"")+1);
 System.out.println(id);

Answer 2

你正试图通过非常困难的方式来做到这一点。如果 lock-unlock.txt 文件的每一行都与您发布的片段相同，您可以执行以下操作：

File logFile = new File("lock-unlock.txt");

List<String> lines = Files.readAllLines(logFile.toPath());

List<Integer> ids = lines.stream()
                .filter(line -> line.contains("ieatrcxb"))
                .map(line -> line.split( "\"")[1]) //"ieatrcxb4498-1"
                .map(line -> line.replaceAll("\D+","")) //"44981"
                .map(Integer::parseInt) // 44981
                .collect( Collectors.toList() );

如果您不只是查找 ID 号，只需 remove/comment 第二次和第三次 .map() 方法调用，但结果将是字符串列表而不是整数。

Answer 3

ID 是否始终采用“ieatrcxb 后跟 4 位数字，后跟 -，后跟 1 位数字”的格式？

如果是这样，你可以这样做：

regexChecker("ieatrcxb\d{4}-\d", line);

请注意 {4} 量词，它恰好匹配 4 位数字 (\d)。如果最后一位总是 1，您也可以使用 "ieatrcxb\d{4}-1".

如果位数不同，可以使用"ieatrcxb\d+-\d+"，其中+表示“1个或多个”。

您还可以将 {} 量词与最小和最大出现次数一起使用。示例："ieatrcxb\d{4,6}-\d" - {4,6} 表示 "minimum of 4 and maximum of 6 occurrences"（这只是一个例子，我不知道你的情况是否如此）。如果您确切知道 ID 可以包含多少位数字，这将非常有用。

以上所有工作都适用于您的案例，返回 ieatrcxb4498-1。使用哪一个将取决于您的输入如何变化。

如果您只想要没有 ieatrcxb 部分的数字 (4498-1)，您可以使用 lookbehind regex:

regexChecker("(?<=ieatrcxb)\d{4,6}-\d", line);

这使得 ieatrcxb 不属于匹配项，因此只返回 4498-1。

如果您也不想要 -1 而只想要 4498，您可以将其与前瞻相结合：

regexChecker("(?<=ieatrcxb)\d{4,6}(?=-\d)", line)

这个returns只是4498。

Java 用于提取 id 字符串的正则表达式，基于每个 id 的重复子字符串

Java Regex to extract an id string, based on recurring sub-string of each id

java

regex

matcher

pattern-matching