Jsoup 中的新行字符处理
New line character handling in Jsoup
当使用 JSoup 解析 html 时,如果文本字符串中有一个换行符,它会将其视为不存在。考虑:This string of text will wrap
here because of a new line character
。但是当 JSoup 解析这个字符串时它 returns This string of text will wraphere because of a new line character
。请注意,换行符甚至不会变成 space。我只想将其 return 编辑为 space。这是节点内的文本。我在 Whosebug 上看到过其他解决方案,人们希望或不希望在标记后换行。那不是我想要的。我只是想知道我是否可以将解析函数修改为 return 而不是忽略换行符。
你能试试吗,根据这里的答案获取全文:Prevent Jsoup from discarding extra whitespace
/**
* @param cell element that contains whitespace formatting
* @return
*/
public static String getText(Element cell) {
String text = null;
List<Node> childNodes = cell.childNodes();
if (childNodes.size() > 0) {
Node childNode = childNodes.get(0);
if (childNode instanceof TextNode) {
text = ((TextNode)childNode).getWholeText();
}
}
if (text == null) {
text = cell.text();
}
return text;
}
我明白了。我在从 url 获取 html 时犯了一个错误。我正在使用这种方法:
public static String getUrl(String url) {
URL urlObj = null;
try{
urlObj = new URL(url);
}
catch(MalformedURLException e) {
System.out.println("The url was malformed!");
return "";
}
URLConnection urlCon = null;
BufferedReader in = null;
String outputText = "";
try{
urlCon = urlObj.openConnection();
in = new BufferedReader(new InputStreamReader(urlCon.getInputStream()));
String line = "";
while((line = in.readLine()) != null){
outputText += line;
}
in.close();
}
catch(IOException e){
System.out.println("There was an error connecting to the URL");
return "no";
}
return outputText;
}
当我应该使用以下内容时:
public static String getUrl(String url) {
URL urlObj = null;
try{
urlObj = new URL(url);
}
catch(MalformedURLException e) {
System.out.println("The url was malformed!");
return "";
}
URLConnection urlCon = null;
BufferedReader in = null;
String outputText = "";
try{
urlCon = urlObj.openConnection();
in = new BufferedReader(new InputStreamReader(urlCon.getInputStream()));
String line = "";
while((line = in.readLine()) != null){
outputText += line + "/n";
}
in.close();
}
catch(IOException e){
System.out.println("There was an error connecting to the URL");
return "no";
}
return outputText;
}
问题与JSoup无关。我想我会在这里记下它,因为我使用 Java 从 Instant Web Scraping 复制了这段代码
Ryan Mitchell 和任何其他遵循本教程的人可能有同样的问题。
当使用 JSoup 解析 html 时,如果文本字符串中有一个换行符,它会将其视为不存在。考虑:This string of text will wrap
here because of a new line character
。但是当 JSoup 解析这个字符串时它 returns This string of text will wraphere because of a new line character
。请注意,换行符甚至不会变成 space。我只想将其 return 编辑为 space。这是节点内的文本。我在 Whosebug 上看到过其他解决方案,人们希望或不希望在标记后换行。那不是我想要的。我只是想知道我是否可以将解析函数修改为 return 而不是忽略换行符。
你能试试吗,根据这里的答案获取全文:Prevent Jsoup from discarding extra whitespace
/**
* @param cell element that contains whitespace formatting
* @return
*/
public static String getText(Element cell) {
String text = null;
List<Node> childNodes = cell.childNodes();
if (childNodes.size() > 0) {
Node childNode = childNodes.get(0);
if (childNode instanceof TextNode) {
text = ((TextNode)childNode).getWholeText();
}
}
if (text == null) {
text = cell.text();
}
return text;
}
我明白了。我在从 url 获取 html 时犯了一个错误。我正在使用这种方法:
public static String getUrl(String url) {
URL urlObj = null;
try{
urlObj = new URL(url);
}
catch(MalformedURLException e) {
System.out.println("The url was malformed!");
return "";
}
URLConnection urlCon = null;
BufferedReader in = null;
String outputText = "";
try{
urlCon = urlObj.openConnection();
in = new BufferedReader(new InputStreamReader(urlCon.getInputStream()));
String line = "";
while((line = in.readLine()) != null){
outputText += line;
}
in.close();
}
catch(IOException e){
System.out.println("There was an error connecting to the URL");
return "no";
}
return outputText;
}
当我应该使用以下内容时:
public static String getUrl(String url) {
URL urlObj = null;
try{
urlObj = new URL(url);
}
catch(MalformedURLException e) {
System.out.println("The url was malformed!");
return "";
}
URLConnection urlCon = null;
BufferedReader in = null;
String outputText = "";
try{
urlCon = urlObj.openConnection();
in = new BufferedReader(new InputStreamReader(urlCon.getInputStream()));
String line = "";
while((line = in.readLine()) != null){
outputText += line + "/n";
}
in.close();
}
catch(IOException e){
System.out.println("There was an error connecting to the URL");
return "no";
}
return outputText;
}
问题与JSoup无关。我想我会在这里记下它,因为我使用 Java 从 Instant Web Scraping 复制了这段代码 Ryan Mitchell 和任何其他遵循本教程的人可能有同样的问题。