如何使用 Java 获取带有模式的子字符串
How to get substring with pattern using Java
我有一个包含如下记录的文件:
drwxr-xr-x - root supergroup 0 2015-04-05 05:26 /user/root
drwxr-xr-x - hadoop supergroup 0 2014-11-05 11:56 /user/root/input
drwxr-xr-x - hadoop supergroup 0 2014-11-05 03:06 /user/root/input/foo
drwxr-xr-x - hadoop supergroup 0 2015-04-28 03:06 /user/root/input/foo/bar
drwxr-xr-x - hadoop supergroup 0 2013-11-06 15:54 /user/root/input/foo/bar/20120706
-rw-r--r-- 3 hadoop supergroup 0 2013-11-06 15:54 /user/root/input/foo/bar/20120706/_SUCCESS
drwxr-xr-x - hadoop supergroup 0 2013-11-06 15:54 /user/root/input/foo/bar/20120706/_logs
drwxr-xr-x - hadoop supergroup 0 2013-11-06 15:54 /user/root/input/foo/bar/20120706/_logs/history
在Java代码中,我使用Pattern
和Matcher
class来获取我想稍后处理的子字符串。代码如清单:
String filename = "D:\temp\files_in_hadoop_temp.txt";
Pattern thePattern
= Pattern.compile("[a-z\-]+\s+(\-|[0-9]) (root|hadoop)\s+supergroup\s+([0-9]+) ([0-9\-]+) ([0-9:]+) (\D+)\/?.*");
try
{
Files.lines(Paths.get(filename))
.map(line -> thePattern.matcher(line))
.collect(Collectors.toList())
.forEach(theMather -> {
if (theMather.find())
{
System.out.println(theMather.group(3) + "-" + theMather.group(4) + "-" + theMather.group(6));
}
});
} catch (IOException e)
{
e.printStackTrace();
}
结果如下:
0-2015-04-05-/user/root
0-2014-11-05-/user/root/input
0-2014-11-05-/user/root/input/foo
0-2015-04-28-/user/root/input/foo/bar
0-2013-11-06-/user/root/input/foo/bar/
0-2013-11-06-/user/root/input/foo/bar/
0-2013-11-06-/user/root/input/foo/bar/
0-2013-11-06-/user/root/input/foo/bar/
但我预期的结果是前三行没有尾部“/”。我尝试了很多模式来去除尾部“/”但都失败了。
能否请您提供一些关于去除尾部“/”的模式的建议。
非常感谢。
您可以做的是检查一个简单的 if 语句,如果最后一个字符是斜杠,并使用子字符串获取新字符串:
if (theMather.find())
{
String data = theMather.group(3) + "-" + theMather.group(4) + "-" + theMather.group(6);
//String data = theMather.group(3) + "-" + theMather.group(4) + "-" + theMather.group(6);
if(data.charAt(data.length() - 1) == '/')
data = data.substring(0, data.length() - 1);
System.out.println(data);
}
使用字符集确保最后一个字符不是斜线。因此,而不是
(\D+)\/?.*"
尝试
(\D*[^\d/]).*
圆括号中的部分匹配最长的非数字子串,附加的限制是最后一个字符不能是斜杠。
注:已测试。
我有一个包含如下记录的文件:
drwxr-xr-x - root supergroup 0 2015-04-05 05:26 /user/root
drwxr-xr-x - hadoop supergroup 0 2014-11-05 11:56 /user/root/input
drwxr-xr-x - hadoop supergroup 0 2014-11-05 03:06 /user/root/input/foo
drwxr-xr-x - hadoop supergroup 0 2015-04-28 03:06 /user/root/input/foo/bar
drwxr-xr-x - hadoop supergroup 0 2013-11-06 15:54 /user/root/input/foo/bar/20120706
-rw-r--r-- 3 hadoop supergroup 0 2013-11-06 15:54 /user/root/input/foo/bar/20120706/_SUCCESS
drwxr-xr-x - hadoop supergroup 0 2013-11-06 15:54 /user/root/input/foo/bar/20120706/_logs
drwxr-xr-x - hadoop supergroup 0 2013-11-06 15:54 /user/root/input/foo/bar/20120706/_logs/history
在Java代码中,我使用Pattern
和Matcher
class来获取我想稍后处理的子字符串。代码如清单:
String filename = "D:\temp\files_in_hadoop_temp.txt";
Pattern thePattern
= Pattern.compile("[a-z\-]+\s+(\-|[0-9]) (root|hadoop)\s+supergroup\s+([0-9]+) ([0-9\-]+) ([0-9:]+) (\D+)\/?.*");
try
{
Files.lines(Paths.get(filename))
.map(line -> thePattern.matcher(line))
.collect(Collectors.toList())
.forEach(theMather -> {
if (theMather.find())
{
System.out.println(theMather.group(3) + "-" + theMather.group(4) + "-" + theMather.group(6));
}
});
} catch (IOException e)
{
e.printStackTrace();
}
结果如下:
0-2015-04-05-/user/root
0-2014-11-05-/user/root/input
0-2014-11-05-/user/root/input/foo
0-2015-04-28-/user/root/input/foo/bar
0-2013-11-06-/user/root/input/foo/bar/
0-2013-11-06-/user/root/input/foo/bar/
0-2013-11-06-/user/root/input/foo/bar/
0-2013-11-06-/user/root/input/foo/bar/
但我预期的结果是前三行没有尾部“/”。我尝试了很多模式来去除尾部“/”但都失败了。
能否请您提供一些关于去除尾部“/”的模式的建议。
非常感谢。
您可以做的是检查一个简单的 if 语句,如果最后一个字符是斜杠,并使用子字符串获取新字符串:
if (theMather.find())
{
String data = theMather.group(3) + "-" + theMather.group(4) + "-" + theMather.group(6);
//String data = theMather.group(3) + "-" + theMather.group(4) + "-" + theMather.group(6);
if(data.charAt(data.length() - 1) == '/')
data = data.substring(0, data.length() - 1);
System.out.println(data);
}
使用字符集确保最后一个字符不是斜线。因此,而不是
(\D+)\/?.*"
尝试
(\D*[^\d/]).*
圆括号中的部分匹配最长的非数字子串,附加的限制是最后一个字符不能是斜杠。
注:已测试。