如何在 HTML 而不是纯文本中获取 BoilerPipe 提取的结果
How to get result of BoilerPipe extraction in HTML instead of plain text
我正在使用以下代码从网页中提取文本内容,我的应用托管在 Google App Engine 上,其工作方式与 BoilerPipe Web API. The problem is that I can only get the result in plain text format. I played around the library to find a work around, but I couldn't find a method to display the result in HTML. What I am trying to have is to include a option like HTML (extract mode) as in the original BoilerPipe Web API here.
完全相同
这是我用来提取纯文本的代码。
PrintWriter out = response.getWriter();
try {
String urlString = request.getParameter("url");
String listOUtput = request.getParameter("OutputType");
String listExtractor = request.getParameter("ExtractorType");
URL url = new URL(urlString);
switch (listExtractor) {
case "1":
String mainArticle = ArticleExtractor.INSTANCE.getText(url);
out.println(mainArticle);
break;
case "2":
String fullArticle = KeepEverythingExtractor.INSTANCE.getText(url);
out.println(fullArticle);
break;
}
} catch (BoilerpipeProcessingException e) {
out.println("Sorry We Couldn't Scrape the URL you Entered " + e.getLocalizedMessage());
} catch (IOException e) {
out.println("Exception thrown");
}
如何添加以 HTML 形式显示结果的功能?
我使用的是Boilerpipe的源码,解决你的问题的代码如下:
String urlString = "your url";
URL url = new URL(urlString);
URI uri = new URI(urlString);
final HTMLDocument htmlDoc = HTMLFetcher.fetch(url);
final BoilerpipeExtractor extractor = CommonExtractors.DEFAULT_EXTRACTOR;
final HTMLHighlighter hh = HTMLHighlighter.newExtractingInstance();
hh.setOutputHighlightOnly(true);
TextDocument doc;
String text = "";
doc = new BoilerpipeSAXInput(htmlDoc.toInputSource()).getTextDocument();
extractor.process(doc);
final InputSource is = htmlDoc.toInputSource();
text = hh.process(doc, is);
System.out.println(text);
我正在使用以下代码从网页中提取文本内容,我的应用托管在 Google App Engine 上,其工作方式与 BoilerPipe Web API. The problem is that I can only get the result in plain text format. I played around the library to find a work around, but I couldn't find a method to display the result in HTML. What I am trying to have is to include a option like HTML (extract mode) as in the original BoilerPipe Web API here.
完全相同这是我用来提取纯文本的代码。
PrintWriter out = response.getWriter();
try {
String urlString = request.getParameter("url");
String listOUtput = request.getParameter("OutputType");
String listExtractor = request.getParameter("ExtractorType");
URL url = new URL(urlString);
switch (listExtractor) {
case "1":
String mainArticle = ArticleExtractor.INSTANCE.getText(url);
out.println(mainArticle);
break;
case "2":
String fullArticle = KeepEverythingExtractor.INSTANCE.getText(url);
out.println(fullArticle);
break;
}
} catch (BoilerpipeProcessingException e) {
out.println("Sorry We Couldn't Scrape the URL you Entered " + e.getLocalizedMessage());
} catch (IOException e) {
out.println("Exception thrown");
}
如何添加以 HTML 形式显示结果的功能?
我使用的是Boilerpipe的源码,解决你的问题的代码如下:
String urlString = "your url";
URL url = new URL(urlString);
URI uri = new URI(urlString);
final HTMLDocument htmlDoc = HTMLFetcher.fetch(url);
final BoilerpipeExtractor extractor = CommonExtractors.DEFAULT_EXTRACTOR;
final HTMLHighlighter hh = HTMLHighlighter.newExtractingInstance();
hh.setOutputHighlightOnly(true);
TextDocument doc;
String text = "";
doc = new BoilerpipeSAXInput(htmlDoc.toInputSource()).getTextDocument();
extractor.process(doc);
final InputSource is = htmlDoc.toInputSource();
text = hh.process(doc, is);
System.out.println(text);