在 Java 中打印网页内容
Printing the content of web page in Java
我正在尝试使用 HttpURLconnection class 读取 https://example.com/ 的内容。我已经删除了尖括号之间的 html 标签,但是我没有删除花括号之间的单词。此外,需要打印的单词之间没有 space。
代码如下:
URL url = new URL("https://example.com/");
Scanner sc = new Scanner(url.openStream());
StringBuffer sb = new StringBuffer();
while(sc.hasNext()) {
sb.append(sc.next());
}
String result = sb.toString();
//Removing the HTML tags
result = result.replaceAll("<[^>]*>", " ");
System.out.println("Contents of the web page: "+result);
这是我得到的输出:
网页内容:ExampleDomain body{background-color:#f0f0f2;margin:0;padding:0;font-family:-apple-system,system-ui,BlinkMacSystemFont,"SegoeUI","OpenSans","HelveticaNeue",Helvetica,Arial,sans-serif;}div{width:600px;margin:5emauto;padding:2em;background-color:#fdfdff; border-radius:0.5em;box-shadow:2px3px7px2pxrgba(0,0,0,0.02);}a:link,a:visited{color:#38488f;text-decoration:none; }@media(max-width:700px){div{margin:0auto;width:auto;}} ExampleDomain Thisdomainisforuseinillustrativeexamplesindocuments.Youmayusethisdomaininliteraturewithoutpriorcoordinationoraskingforpermission。更多信息...
如何去掉花括号之间的内容?
以及如何在句子中的单词之间放置 space?
去除花括号之间的内容,可以使用String#replaceAll(String, String)
。 Javadoc
str.replaceAll("\{.*\}", "");
此正则表达式匹配左大括号和右大括号之间的所有字符。所以你的代码将是:
URL url = new URL("https://example.com/");
Scanner sc = new Scanner(url.openStream());
StringBuffer sb = new StringBuffer();
while (sc.hasNext()) {
sb.append(" " + sc.next());
}
String result = sb.toString();
// Removing the HTML tags
result = result.replaceAll("<[^>]*>", "");
// Removing the CSS stuff
result = result.replaceAll("\{.*\}", "");
System.out.println("Contents of the web page: " + result);
我正在尝试使用 HttpURLconnection class 读取 https://example.com/ 的内容。我已经删除了尖括号之间的 html 标签,但是我没有删除花括号之间的单词。此外,需要打印的单词之间没有 space。
代码如下:
URL url = new URL("https://example.com/");
Scanner sc = new Scanner(url.openStream());
StringBuffer sb = new StringBuffer();
while(sc.hasNext()) {
sb.append(sc.next());
}
String result = sb.toString();
//Removing the HTML tags
result = result.replaceAll("<[^>]*>", " ");
System.out.println("Contents of the web page: "+result);
这是我得到的输出:
网页内容:ExampleDomain body{background-color:#f0f0f2;margin:0;padding:0;font-family:-apple-system,system-ui,BlinkMacSystemFont,"SegoeUI","OpenSans","HelveticaNeue",Helvetica,Arial,sans-serif;}div{width:600px;margin:5emauto;padding:2em;background-color:#fdfdff; border-radius:0.5em;box-shadow:2px3px7px2pxrgba(0,0,0,0.02);}a:link,a:visited{color:#38488f;text-decoration:none; }@media(max-width:700px){div{margin:0auto;width:auto;}} ExampleDomain Thisdomainisforuseinillustrativeexamplesindocuments.Youmayusethisdomaininliteraturewithoutpriorcoordinationoraskingforpermission。更多信息...
如何去掉花括号之间的内容? 以及如何在句子中的单词之间放置 space?
去除花括号之间的内容,可以使用String#replaceAll(String, String)
。 Javadoc
str.replaceAll("\{.*\}", "");
此正则表达式匹配左大括号和右大括号之间的所有字符。所以你的代码将是:
URL url = new URL("https://example.com/");
Scanner sc = new Scanner(url.openStream());
StringBuffer sb = new StringBuffer();
while (sc.hasNext()) {
sb.append(" " + sc.next());
}
String result = sb.toString();
// Removing the HTML tags
result = result.replaceAll("<[^>]*>", "");
// Removing the CSS stuff
result = result.replaceAll("\{.*\}", "");
System.out.println("Contents of the web page: " + result);