如何在 HTML 而不是纯文本中获取 BoilerPipe 提取的结果

Question

我正在使用以下代码从网页中提取文本内容，我的应用托管在 Google App Engine 上，其工作方式与 BoilerPipe Web API. The problem is that I can only get the result in plain text format. I played around the library to find a work around, but I couldn't find a method to display the result in HTML. What I am trying to have is to include a option like HTML (extract mode) as in the original BoilerPipe Web API here.

完全相同

这是我用来提取纯文本的代码。

 PrintWriter out = response.getWriter();
    try {
        String urlString = request.getParameter("url");
        String listOUtput = request.getParameter("OutputType");
        String listExtractor = request.getParameter("ExtractorType");
        URL url = new URL(urlString);
        switch (listExtractor) {
            case "1":
                String mainArticle = ArticleExtractor.INSTANCE.getText(url);
                out.println(mainArticle);
                break;
            case "2":
                String fullArticle = KeepEverythingExtractor.INSTANCE.getText(url);
                out.println(fullArticle);
                break;
        }
    } catch (BoilerpipeProcessingException e) {
        out.println("Sorry We Couldn't Scrape the URL you Entered " + e.getLocalizedMessage());
    } catch (IOException e) {
        out.println("Exception thrown");
    }

如何添加以 HTML 形式显示结果的功能？

Answer 1

我使用的是Boilerpipe的源码，解决你的问题的代码如下：

String urlString = "your url";
URL url = new URL(urlString);
URI uri = new URI(urlString);

final HTMLDocument htmlDoc = HTMLFetcher.fetch(url);

final BoilerpipeExtractor extractor = CommonExtractors.DEFAULT_EXTRACTOR;

final HTMLHighlighter hh = HTMLHighlighter.newExtractingInstance();
hh.setOutputHighlightOnly(true);

TextDocument doc;

String text = "";

doc = new BoilerpipeSAXInput(htmlDoc.toInputSource()).getTextDocument();
extractor.process(doc);
final InputSource is = htmlDoc.toInputSource();
text = hh.process(doc, is);

System.out.println(text);

Source

如何在 HTML 而不是纯文本中获取 BoilerPipe 提取的结果

How to get result of BoilerPipe extraction in HTML instead of plain text

servlets

web-scraping

boilerpipe

jakarta-ee