java 网络爬虫下载了太多 GB 的数据

Question

我编写了一个网络爬虫程序。但是在抓取时它下载了太多 GB 的数据。

我只想阅读文字（避免图片...等）。

我使用 Boilerpipe 从 html

中提取内容

这是我找到最终重定向的方法 url

public String getFinalRedirectedUrl(String url) throws IOException{
    HttpURLConnection connection;
    String finalUrl = url;
    int redirectCount = 0;
    do {
        connection = (HttpURLConnection) new URL(finalUrl)
                .openConnection();
        connection.setConnectTimeout(Config.HTTP_CONNECTION_TIMEOUT_TIME);
        connection.setReadTimeout(Config.HTTP_READ_TIMEOUT_TIME);
        connection.setInstanceFollowRedirects(false);
        connection.setUseCaches(false);
        connection.setRequestMethod("GET");
        connection.connect();
        int responseCode = connection.getResponseCode();
        if (responseCode >= 300 && responseCode < 400) {
            String redirectedUrl = connection.getHeaderField("Location");
            if (null == redirectedUrl)
                break;
            finalUrl = redirectedUrl;
            redirectCount++;
            if(redirectCount > Config.MAX_REDIRECT_COUNT){
                throw new java.net.ProtocolException("Server redirected too many  times ("+Config.MAX_REDIRECT_COUNT+")");
            }
        } else{
            break;
        }
    } while (connection.getResponseCode() != HttpURLConnection.HTTP_OK);
    connection.disconnect();

    return finalUrl;
}

这就是我获取 url

的方式

private HTMLDocument fetch(URL url) throws IOException{
    final HttpURLConnection httpcon = (HttpURLConnection) url.openConnection();
    httpcon.setFollowRedirects(true);
    httpcon.setConnectTimeout(Config.HTTP_CONNECTION_TIMEOUT_TIME);
    httpcon.setReadTimeout(Config.HTTP_READ_TIMEOUT_TIME);
    httpcon.addRequestProperty("User-Agent", "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:10.0.2) Gecko/20100101 Firefox/10.0.2");
    final String ct = httpcon.getContentType();

    Charset cs = Charset.forName("Cp1252");
    if (ct != null) {
        if(!ct.contains("text/html")){
            System.err.println("Content type is:"+ct);
            return new HTMLDocument("");
        }

        Matcher m = PAT_CHARSET.matcher(ct);
        if(m.find()) {
                final String charset = m.group(1);
                try {
                        cs = Charset.forName(charset);
                } catch (UnsupportedCharsetException | IllegalCharsetNameException e) {
                        // keep default
                }
        }
    }

    InputStream in = httpcon.getInputStream();

    final String encoding = httpcon.getContentEncoding();
    if(encoding != null) {
        if("gzip".equalsIgnoreCase(encoding)) {
                in = new GZIPInputStream(in);
        } else {
                System.err.println("WARN: unsupported Content-Encoding: "+encoding);
        }
    }

    ByteArrayOutputStream bos = new ByteArrayOutputStream();
    byte[] buf = new byte[4096];
    int r;
    while ((r = in.read(buf)) != -1) {
        bos.write(buf, 0, r);
    }
    in.close();

    final byte[] data = bos.toByteArray();

    return new HTMLDocument(data, cs);
}

并使用 Boilerpipe

获取正文

HTMLDocument htmlDoc = fetch(new URL(url));
String body = ArticleExtractor.INSTANCE.getText(htmlDoc.toInputSource());

如何减少下载的数据量？

Answer 1

使用 JSoup

减少了下载的 GB 并提高了效率

public HashMap<String, String> fetchWithJsoup(String url, String iniUrl, int redirCount)
                                        throws IOException
{
    HashMap<String, String> returnObj = new HashMap<>();

    Connection con;
    try{
        con = Jsoup.connect(url);
    }catch(IllegalArgumentException ex){
        if(ex.getMessage().contains("Malformed URL")){
            System.err.println("Malformed URL:: "
                +ex.getClass().getName()+": "+ex.getMessage()+" > "+iniUrl);
        }else{
            Logger.getLogger(ContentGetter.class.getName()).log(Level.SEVERE, null, ex);
        }
        returnObj.put(RETURN_FINAL_URL, url);
        returnObj.put(RETURN_BODY, "");
        return returnObj;
    }

    con.userAgent("Mozilla/5.0 (Windows NT 6.1; WOW64; rv:10.0.2) Gecko/20100101 Firefox/10.0.2");

    con.timeout(Config.HTTP_CONNECTION_TIMEOUT_TIME);
    Document doc = con.get();

    String uri = doc.baseUri();
    returnObj.put(RETURN_FINAL_URL, uri);

    Elements redirEle = doc.head().select("meta[http-equiv=refresh]");
    if(redirEle.size() > 0){
        String content = redirEle.get(0).attr("content");
        Pattern pattern = Pattern.compile("^.*URL=(.+)$", Pattern.CASE_INSENSITIVE);
        Matcher matcher = pattern.matcher(content);
        if (matcher.matches() && matcher.groupCount() > 0) {
            String redirectUrl = matcher.group(1);
            if(redirectUrl.startsWith("'")){
                /*removes single quotes of urls within single quotes*/
                redirectUrl = redirectUrl.replaceAll("(^')|('$)","");
            }
            if(redirectUrl.startsWith("/")){
                String[] splitedUrl = url.split("/");
                redirectUrl = splitedUrl[0]+"//"+splitedUrl[2]+redirectUrl;
            }
            if(!redirectUrl.equals(url)){
                redirCount++;
                if(redirCount < Config.MAX_REDIRECT_COUNT){
                    return fetchWithJsoup(redirectUrl, iniUrl, redirCount);
                }
            }
        }
    }

    HTMLDocument htmlDoc = new HTMLDocument(doc.html());
    String body = "";
    try{
        if(htmlDoc != null){
            body = ArticleExtractor.INSTANCE.getText(htmlDoc.toInputSource());
        }
    }catch(OutOfMemoryError ex){
        System.err.println("OutOfMemoryError while extracting text !!!!!!!!");
        System.gc();
    } catch (BoilerpipeProcessingException ex) {
        Logger.getLogger(ContentGetter.class.getName()).log(Level.SEVERE, null, ex);
    }
    returnObj.put(RETURN_BODY, body);

    return returnObj;
}

java 网络爬虫下载了太多 GB 的数据

java web crawler downloads too many GB data

java

web-crawler

httpurlconnection

boilerpipe