URLFetchService 在尝试获取纽约时报页面时使用 GAE returns null

Question

我正在使用以下代码来获取纽约时报页面的 html，不幸的是，这将返回 null。我尝试过其他网站（CNN、卫报等），它们运行良好。我正在使用来自 Google App Engine 的 URLFetchService。

这是代码片段。请告诉我我做错了什么？

//url = https://www.nytimes.com/2017/05/02/us/politics/health-care-paul-ryan-fred-upton-congress.html

private String extractFromUrl(String url, boolean forced) throws java.io.IOException, org.xml.sax.SAXException,
                      de.l3s.boilerpipe.BoilerpipeProcessingException  {

    Future<HTTPResponse> urlFuture = getMultiResponse(url);

    HTTPResponse urlResponse = null;
    try {
        urlResponse = urlFuture.get(); // Returns null here
    } catch ( InterruptedException ie ) {
        ie.printStackTrace();
    } catch ( ExecutionException ee ) {
        ee.printStackTrace();
    }

    String urlResponseString = new String(urlResponse.getContent());
    return urlResponseString;
}

public Future<HTTPResponse> getMultiResponse(String website) {
    URLFetchService fetcher = URLFetchServiceFactory.getURLFetchService();
    URL url = null;
    try {
        url = new URL(website);
    } catch (MalformedURLException e) {
        e.printStackTrace();
    }

    FetchOptions fetchOptions = FetchOptions.Builder.followRedirects();
    HTTPRequest request = new HTTPRequest(url, HTTPMethod.GET, fetchOptions);
    Future<HTTPResponse> futureResponse = fetcher.fetchAsync(request);
    return futureResponse;
}

我遇到的异常是这样的：

java.util.concurrent.ExecutionException: java.io.IOException: Could not fetch URL: https://www.nytimes.com/2017/05/02/us/politics/health-care-paul-ryan-fred-upton-congress.html, error: Received exception executing http method GET against URL https://www.nytimes.com/2017/05/02/us/politics/health-care-paul-ryan-fred-upton-congress.html: null
[INFO]  at com.google.appengine.api.utils.FutureWrapper.setExceptionResult(FutureWrapper.java:66)
[INFO]  at com.google.appengine.api.utils.FutureWrapper.get(FutureWrapper.java:97)
[INFO]  at main.java.com.myapp.app.MyServlet.extractFromUrl(MyServlet.java:10)

Answer 1

查看 curl 的详细输出，您可以看到该网站尝试设置 cookie 并在不接受 cookie 的情况下重定向您。

似乎时代会在放弃之前将您重定向 7 次 -

$ curl --verbose -L "https://www.nytimes.com/2017/05/02/us/politics/health-care-paul-ryan-fred-upton-congress.html" 2>&1 | grep 303 | wc -l
7

UrlFetch 的最大重定向数似乎是 5 [0]。

为了成功抓取 www.nytimes.com，您必须禁用以下重定向并自行处理 cookie 逻辑。此处 [1] 和此处 [2]

的一些灵感

[0] https://groups.google.com/forum/#!topic/google-appengine/F2dX3LqOrhY

[1] https://groups.google.com/d/msg/google-appengine-java/pE0xak7LRxg/M__U-SM3YMMJ

[2]

URLFetchService 在尝试获取纽约时报页面时使用 GAE returns null

URLFetchService using GAE returns null when trying to fetch New York Times page

java

google-app-engine

httprequest

urlfetch