Java 提取标签和属性之间的文本
Java Extracting Text Between Tags and Attributes
我正在尝试提取特定标签和属性之间的文本。现在,我尝试提取标签。我正在阅读一个“.gexf”文件,其中包含 XML 数据。然后我将这些数据保存为一个字符串。然后我试图在 "nodes" 标签之间提取文本。到目前为止,这是我的代码:
import java.io.BufferedReader;
import java.io.FileReader;
import java.io.IOException;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Main {
private static String filePath = "src/babel.gexf";
public String readFile(String filePath) throws IOException {
BufferedReader br = new BufferedReader(new FileReader(filePath));
try {
StringBuilder sb = new StringBuilder();
String line = br.readLine();
while (line != null) {
sb.append(line);
sb.append("\n");
line = br.readLine();
}
return sb.toString();
} finally {
br.close();
}
}
public void getNodesContent(String content) throws IOException {
final Pattern pattern = Pattern.compile("<nodes>(\w+)</nodes>", Pattern.MULTILINE);
final Matcher matcher = pattern.matcher(content);
while (matcher.find()) {
System.out.println(matcher.group(1));
}
}
public static void main(String [] args) throws IOException {
Main m = new Main();
String result = m.readFile(filePath);
m.getNodesContent(result);
}
}
在上面的代码中,我没有得到任何结果。当我尝试使用像 "My string" 这样的示例字符串时,我得到了结果。 Link 的gexf(因为太长,只好上传)文件:
https://files.fm/u/qag5ykrx
没有文件样本,我只能提出这么多建议。相反,我可以告诉您的是,您可以使用标签搜索循环获取该文本的子字符串。这是一个例子:
String s = "<a>test</a><b>list</b><a>class</a>";
int start = 0, end = 0;
for(int i = 0; i < s.toCharArray().length-1; i++){
if(s.toCharArray()[i] == '<' && s.toCharArray()[i+1] == 'a' && s.toCharArray()[i+2] == '>'){
start = i+3;
for(int j = start+3; j < s.toCharArray().length-1; j++){
if(s.toCharArray()[j] == '<' && s.toCharArray()[j+1] == '/' && s.toCharArray()[j+2] == 'a' && s.toCharArray()[j+3] == '>'){
end = j;
System.out.println(s.substring(start, end));
break;
}
}
}
}
上面的代码将在字符串 s 中搜索标签,然后从找到它的地方开始并继续,直到找到结束标签。然后它使用这两个位置创建字符串的子字符串,该字符串是两个标签之间的文本。您可以根据需要堆叠任意数量的这些标签搜索。这是 2 标签搜索的示例:
String s = "<a>test</a><b>list</b><a>class</a>";
int start = 0, end = 0;
for(int i = 0; i < s.toCharArray().length-1; i++){
if((s.toCharArray()[i] == '<' && s.toCharArray()[i+1] == 'a' && s.toCharArray()[i+2] == '>') ||
(s.toCharArray()[i] == '<' && s.toCharArray()[i+1] == 'b' && s.toCharArray()[i+2] == '>')){
start = i+3;
for(int j = start+3; j < s.toCharArray().length-1; j++){
if((s.toCharArray()[j] == '<' && s.toCharArray()[j+1] == '/' && s.toCharArray()[j+2] == 'a' && s.toCharArray()[j+3] == '>') ||
(s.toCharArray()[j] == '<' && s.toCharArray()[j+1] == '/' && s.toCharArray()[j+2] == 'b' && s.toCharArray()[j+3] == '>')){
end = j;
System.out.println(s.substring(start, end));
break;
}
}
}
}
唯一的区别是我在 if 语句中添加了子句以获取 b 标记之间的文本。该系统用途极为广泛,我认为您会为它的大量使用提供资金。
我不认为将整个文件内容放入一个字符串中是个好主意,但我认为这取决于文件中的内容量。如果内容很多,那么我会以不同的方式阅读该内容。很高兴看到该文件包含的内容的虚构示例。
我想你可以试试这个小方法。它的核心是利用正则表达式 (RegEx) 和 Pattern/Matcher 从标签之间检索所需的子字符串。
使用以下方法阅读文档很重要:
/**
* This method will retrieve a string contained between string tags. You
* specify what the starting and ending tags are within the startTag and
* endTag parameters. It is you who determines what the start and end tags
* are to be which can be any strings.<br><br>
*
* @param inputString (String) Any string to process.<br>
*
* @param startTag (String) The Start Tag String or String. Data content retrieved
* will be directly after this tag.<br><br>
*
* The supplied Start Tag criteria can contain a single special wildcard tag
* (~*~) providing you also place something like the closing chevron (>)
* for an HTML tag after the wildcard tag, for example:<pre>
*
* If we have a string which looks like this:
* {@code
* "<p style=\"padding-left:40px;\">Hello</p>"
* }
* (Note: to pass double quote marks in a string they must be excaped)
*
* and we want to use this method to extract the word "Hello" from between the
* two HTML tags then your Start Tag can be supplied as "<p~*~>" and of course
* your End Tag can be "</p>". The "<p~*~>" would be the same as supplying
* "<p style=\"padding-left:40px;\">". Anything between the characters <p and
* the supplied close chevron (>) is taken into consideration. This allows for
* contents extraction regardless of what HTML attributes are attached to the
* tag. The use of a wildcard tag (~*~) is also allowed in a supplied End
* Tag.</pre><br>
*
* The wildcard is used as a special tag so that strings that actually
* contain asterisks (*) can be processed as regular asterisks.<br>
*
* @param endTag (String) The End Tag or String. Data content retrieval will
* end just before this Tag is reached.<br>
*
* The supplied End Tag criteria can contain a single special wildcard tag
* (~*~) providing you also place something like the closing chevron (>)
* for an HTML tag after the wildcard tag, for example:<pre>
*
* If we have a string which looks like this:
* {@code
* "<p style=\"padding-left:40px;\">Hello</p>"
* }
* (Note: to pass double quote marks in a string they must be excaped)
*
* and we want to use this method to extract the word "Hello" from between the
* two HTML tags then your Start Tag can be supplied as "<p style=\"padding-left:40px;\">"
* and your End Tag can be "</~*~>". The "</~*~>" would be the same as supplying
* "</p>". Anything between the characters </ and the supplied close chevron (>)
* is taken into consideration. This allows for contents extraction regardless of what the
* HTML tag might be. The use of a wildcard tag (~*~) is also allowed in a supplied Start Tag.</pre><br>
*
* The wildcard is used as a special tag so that strings that actually
* contain asterisks (*) can be processed as regular asterisks.<br>
*
* @param trimFoundData (Optional - Boolean - Default is true) By default
* all retrieved data is trimmed of leading and trailing white-spaces. If
* you do not want this then supply false to this optional parameter.
*
* @return (1D String Array) If there is more than one pair of Start and End
* Tags contained within the supplied input String then each set is placed
* into the Array separately.<br>
*
* @throws IllegalArgumentException if any supplied method String argument
* is Null ("").
*/
public static String[] getBetweenTags(String inputString, String startTag,
String endTag, boolean... trimFoundData) {
if (inputString == null || inputString.equals("") || startTag == null ||
startTag.equals("") || endTag == null || endTag.equals("")) {
throw new IllegalArgumentException("\ngetBetweenTags() Method Error! - "
+ "A supplied method argument contains Null (\"\")!\n"
+ "Supplied Method Arguments:\n"
+ "==========================\n"
+ "inputString = \"" + inputString + "\"\n"
+ "startTag = \"" + startTag + "\"\n"
+ "endTag = \"" + endTag + "\"\n");
}
List<String> list = new ArrayList<>();
boolean trimFound = true;
if (trimFoundData.length > 0) {
trimFound = trimFoundData[0];
}
Matcher matcher;
if (startTag.contains("~*~") || endTag.contains("~*~")) {
startTag = startTag.replace("~*~", ".*?");
endTag = endTag.replace("~*~", ".*?");
Pattern pattern = Pattern.compile("(?iu)" + startTag + "(.*?)" + endTag);
matcher = pattern.matcher(inputString);
} else {
String regexString = Pattern.quote(startTag) + "(?s)(.*?)" + Pattern.quote(endTag);
Pattern pattern = Pattern.compile("(?iu)" + regexString);
matcher = pattern.matcher(inputString);
}
while (matcher.find()) {
String match = matcher.group(1);
if (trimFound) {
match = match.trim();
}
list.add(match);
}
return list.toArray(new String[list.size()]);
}
我正在尝试提取特定标签和属性之间的文本。现在,我尝试提取标签。我正在阅读一个“.gexf”文件,其中包含 XML 数据。然后我将这些数据保存为一个字符串。然后我试图在 "nodes" 标签之间提取文本。到目前为止,这是我的代码:
import java.io.BufferedReader;
import java.io.FileReader;
import java.io.IOException;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Main {
private static String filePath = "src/babel.gexf";
public String readFile(String filePath) throws IOException {
BufferedReader br = new BufferedReader(new FileReader(filePath));
try {
StringBuilder sb = new StringBuilder();
String line = br.readLine();
while (line != null) {
sb.append(line);
sb.append("\n");
line = br.readLine();
}
return sb.toString();
} finally {
br.close();
}
}
public void getNodesContent(String content) throws IOException {
final Pattern pattern = Pattern.compile("<nodes>(\w+)</nodes>", Pattern.MULTILINE);
final Matcher matcher = pattern.matcher(content);
while (matcher.find()) {
System.out.println(matcher.group(1));
}
}
public static void main(String [] args) throws IOException {
Main m = new Main();
String result = m.readFile(filePath);
m.getNodesContent(result);
}
}
在上面的代码中,我没有得到任何结果。当我尝试使用像 "My string" 这样的示例字符串时,我得到了结果。 Link 的gexf(因为太长,只好上传)文件: https://files.fm/u/qag5ykrx
没有文件样本,我只能提出这么多建议。相反,我可以告诉您的是,您可以使用标签搜索循环获取该文本的子字符串。这是一个例子:
String s = "<a>test</a><b>list</b><a>class</a>";
int start = 0, end = 0;
for(int i = 0; i < s.toCharArray().length-1; i++){
if(s.toCharArray()[i] == '<' && s.toCharArray()[i+1] == 'a' && s.toCharArray()[i+2] == '>'){
start = i+3;
for(int j = start+3; j < s.toCharArray().length-1; j++){
if(s.toCharArray()[j] == '<' && s.toCharArray()[j+1] == '/' && s.toCharArray()[j+2] == 'a' && s.toCharArray()[j+3] == '>'){
end = j;
System.out.println(s.substring(start, end));
break;
}
}
}
}
上面的代码将在字符串 s 中搜索标签,然后从找到它的地方开始并继续,直到找到结束标签。然后它使用这两个位置创建字符串的子字符串,该字符串是两个标签之间的文本。您可以根据需要堆叠任意数量的这些标签搜索。这是 2 标签搜索的示例:
String s = "<a>test</a><b>list</b><a>class</a>";
int start = 0, end = 0;
for(int i = 0; i < s.toCharArray().length-1; i++){
if((s.toCharArray()[i] == '<' && s.toCharArray()[i+1] == 'a' && s.toCharArray()[i+2] == '>') ||
(s.toCharArray()[i] == '<' && s.toCharArray()[i+1] == 'b' && s.toCharArray()[i+2] == '>')){
start = i+3;
for(int j = start+3; j < s.toCharArray().length-1; j++){
if((s.toCharArray()[j] == '<' && s.toCharArray()[j+1] == '/' && s.toCharArray()[j+2] == 'a' && s.toCharArray()[j+3] == '>') ||
(s.toCharArray()[j] == '<' && s.toCharArray()[j+1] == '/' && s.toCharArray()[j+2] == 'b' && s.toCharArray()[j+3] == '>')){
end = j;
System.out.println(s.substring(start, end));
break;
}
}
}
}
唯一的区别是我在 if 语句中添加了子句以获取 b 标记之间的文本。该系统用途极为广泛,我认为您会为它的大量使用提供资金。
我不认为将整个文件内容放入一个字符串中是个好主意,但我认为这取决于文件中的内容量。如果内容很多,那么我会以不同的方式阅读该内容。很高兴看到该文件包含的内容的虚构示例。
我想你可以试试这个小方法。它的核心是利用正则表达式 (RegEx) 和 Pattern/Matcher 从标签之间检索所需的子字符串。
使用以下方法阅读文档很重要:
/**
* This method will retrieve a string contained between string tags. You
* specify what the starting and ending tags are within the startTag and
* endTag parameters. It is you who determines what the start and end tags
* are to be which can be any strings.<br><br>
*
* @param inputString (String) Any string to process.<br>
*
* @param startTag (String) The Start Tag String or String. Data content retrieved
* will be directly after this tag.<br><br>
*
* The supplied Start Tag criteria can contain a single special wildcard tag
* (~*~) providing you also place something like the closing chevron (>)
* for an HTML tag after the wildcard tag, for example:<pre>
*
* If we have a string which looks like this:
* {@code
* "<p style=\"padding-left:40px;\">Hello</p>"
* }
* (Note: to pass double quote marks in a string they must be excaped)
*
* and we want to use this method to extract the word "Hello" from between the
* two HTML tags then your Start Tag can be supplied as "<p~*~>" and of course
* your End Tag can be "</p>". The "<p~*~>" would be the same as supplying
* "<p style=\"padding-left:40px;\">". Anything between the characters <p and
* the supplied close chevron (>) is taken into consideration. This allows for
* contents extraction regardless of what HTML attributes are attached to the
* tag. The use of a wildcard tag (~*~) is also allowed in a supplied End
* Tag.</pre><br>
*
* The wildcard is used as a special tag so that strings that actually
* contain asterisks (*) can be processed as regular asterisks.<br>
*
* @param endTag (String) The End Tag or String. Data content retrieval will
* end just before this Tag is reached.<br>
*
* The supplied End Tag criteria can contain a single special wildcard tag
* (~*~) providing you also place something like the closing chevron (>)
* for an HTML tag after the wildcard tag, for example:<pre>
*
* If we have a string which looks like this:
* {@code
* "<p style=\"padding-left:40px;\">Hello</p>"
* }
* (Note: to pass double quote marks in a string they must be excaped)
*
* and we want to use this method to extract the word "Hello" from between the
* two HTML tags then your Start Tag can be supplied as "<p style=\"padding-left:40px;\">"
* and your End Tag can be "</~*~>". The "</~*~>" would be the same as supplying
* "</p>". Anything between the characters </ and the supplied close chevron (>)
* is taken into consideration. This allows for contents extraction regardless of what the
* HTML tag might be. The use of a wildcard tag (~*~) is also allowed in a supplied Start Tag.</pre><br>
*
* The wildcard is used as a special tag so that strings that actually
* contain asterisks (*) can be processed as regular asterisks.<br>
*
* @param trimFoundData (Optional - Boolean - Default is true) By default
* all retrieved data is trimmed of leading and trailing white-spaces. If
* you do not want this then supply false to this optional parameter.
*
* @return (1D String Array) If there is more than one pair of Start and End
* Tags contained within the supplied input String then each set is placed
* into the Array separately.<br>
*
* @throws IllegalArgumentException if any supplied method String argument
* is Null ("").
*/
public static String[] getBetweenTags(String inputString, String startTag,
String endTag, boolean... trimFoundData) {
if (inputString == null || inputString.equals("") || startTag == null ||
startTag.equals("") || endTag == null || endTag.equals("")) {
throw new IllegalArgumentException("\ngetBetweenTags() Method Error! - "
+ "A supplied method argument contains Null (\"\")!\n"
+ "Supplied Method Arguments:\n"
+ "==========================\n"
+ "inputString = \"" + inputString + "\"\n"
+ "startTag = \"" + startTag + "\"\n"
+ "endTag = \"" + endTag + "\"\n");
}
List<String> list = new ArrayList<>();
boolean trimFound = true;
if (trimFoundData.length > 0) {
trimFound = trimFoundData[0];
}
Matcher matcher;
if (startTag.contains("~*~") || endTag.contains("~*~")) {
startTag = startTag.replace("~*~", ".*?");
endTag = endTag.replace("~*~", ".*?");
Pattern pattern = Pattern.compile("(?iu)" + startTag + "(.*?)" + endTag);
matcher = pattern.matcher(inputString);
} else {
String regexString = Pattern.quote(startTag) + "(?s)(.*?)" + Pattern.quote(endTag);
Pattern pattern = Pattern.compile("(?iu)" + regexString);
matcher = pattern.matcher(inputString);
}
while (matcher.find()) {
String match = matcher.group(1);
if (trimFound) {
match = match.trim();
}
list.add(match);
}
return list.toArray(new String[list.size()]);
}