Java 8 CompletedFuture 网络爬虫没有爬过一个 URL
Java 8 CompletedFuture web crawler doesn't crawl past one URL
我正在使用 Java 8 中新引入的并发功能,练习来自 Cay S. Horstmann 的书 "Java SE 8 for the Really Impatient"。我使用新 CompletedFuture and jsoup 创建了以下网络爬虫。基本思路是给定一个 URL,它将在该页面上首先找到 m 个 URL,然后重复该过程 n 次。 m 和 n 当然是参数。问题是程序获取初始页面的 URL 但不递归。我错过了什么?
static class WebCrawler {
CompletableFuture<Void> crawl(final String startingUrl,
final int depth, final int breadth) {
if (depth <= 0) {
return completedFuture(startingUrl, depth);
}
final CompletableFuture<Void> allDoneFuture = allOf((CompletableFuture[]) of(
startingUrl)
.map(url -> supplyAsync(getContent(url)))
.map(docFuture -> docFuture.thenApply(getURLs(breadth)))
.map(urlsFuture -> urlsFuture.thenApply(doForEach(
depth, breadth)))
.toArray(size -> new CompletableFuture[size]));
allDoneFuture.join();
return allDoneFuture;
}
private CompletableFuture<Void> completedFuture(
final String startingUrl, final int depth) {
LOGGER.info("Link: {}, depth: {}.", startingUrl, depth);
CompletableFuture<Void> future = new CompletableFuture<>();
future.complete(null);
return future;
}
private Supplier<Document> getContent(final String url) {
return () -> {
try {
return connect(url).get();
} catch (IOException e) {
throw new UncheckedIOException(
" Something went wrong trying to fetch the contents of the URL: "
+ url, e);
}
};
}
private Function<Document, Set<String>> getURLs(final int limit) {
return doc -> {
LOGGER.info("Getting URLs for document: {}.", doc.baseUri());
return doc.select("a[href]").stream()
.map(link -> link.attr("abs:href")).limit(limit)
.peek(LOGGER::info).collect(toSet());
};
}
private Function<Set<String>, Stream<CompletableFuture<Void>>> doForEach(
final int depth, final int breadth) {
return urls -> urls.stream().map(
url -> crawl(url, depth - 1, breadth));
}
}
测试用例:
@Test
public void testCrawl() {
new WebCrawler().crawl(
"http://en.wikipedia.org/wiki/Java_%28programming_language%29",
2, 10);
}
问题出在以下代码中:
final CompletableFuture<Void> allDoneFuture = allOf(
(CompletableFuture[]) of(startingUrl)
.map(url -> supplyAsync(getContent(url)))
.map(docFuture -> docFuture.thenApply(getURLs(breadth)))
.map(urlsFuture -> urlsFuture.thenApply(doForEach(depth, breadth)))
.toArray(size -> new CompletableFuture[size]));
出于某种原因,您在一个元素的流中执行所有这些操作(这是练习的一部分吗?)。结果是 allDoneFuture
没有跟踪子任务的完成情况。它正在跟踪来自 doForEach
的 Stream<CompletableFuture>
的完成情况。但是该流立即准备就绪,并且永远不会要求其中的期货完成。
通过删除没有任何帮助的流来修复它:
final CompletableFuture<Void> allDoneFuture=supplyAsync(getContent(startingUrl))
.thenApply(getURLs(breadth))
.thenApply(doForEach(depth,breadth))
.thenApply(futures -> futures.toArray(CompletableFuture[]::new))
.thenCompose(CompletableFuture::allOf);
我正在使用 Java 8 中新引入的并发功能,练习来自 Cay S. Horstmann 的书 "Java SE 8 for the Really Impatient"。我使用新 CompletedFuture and jsoup 创建了以下网络爬虫。基本思路是给定一个 URL,它将在该页面上首先找到 m 个 URL,然后重复该过程 n 次。 m 和 n 当然是参数。问题是程序获取初始页面的 URL 但不递归。我错过了什么?
static class WebCrawler {
CompletableFuture<Void> crawl(final String startingUrl,
final int depth, final int breadth) {
if (depth <= 0) {
return completedFuture(startingUrl, depth);
}
final CompletableFuture<Void> allDoneFuture = allOf((CompletableFuture[]) of(
startingUrl)
.map(url -> supplyAsync(getContent(url)))
.map(docFuture -> docFuture.thenApply(getURLs(breadth)))
.map(urlsFuture -> urlsFuture.thenApply(doForEach(
depth, breadth)))
.toArray(size -> new CompletableFuture[size]));
allDoneFuture.join();
return allDoneFuture;
}
private CompletableFuture<Void> completedFuture(
final String startingUrl, final int depth) {
LOGGER.info("Link: {}, depth: {}.", startingUrl, depth);
CompletableFuture<Void> future = new CompletableFuture<>();
future.complete(null);
return future;
}
private Supplier<Document> getContent(final String url) {
return () -> {
try {
return connect(url).get();
} catch (IOException e) {
throw new UncheckedIOException(
" Something went wrong trying to fetch the contents of the URL: "
+ url, e);
}
};
}
private Function<Document, Set<String>> getURLs(final int limit) {
return doc -> {
LOGGER.info("Getting URLs for document: {}.", doc.baseUri());
return doc.select("a[href]").stream()
.map(link -> link.attr("abs:href")).limit(limit)
.peek(LOGGER::info).collect(toSet());
};
}
private Function<Set<String>, Stream<CompletableFuture<Void>>> doForEach(
final int depth, final int breadth) {
return urls -> urls.stream().map(
url -> crawl(url, depth - 1, breadth));
}
}
测试用例:
@Test
public void testCrawl() {
new WebCrawler().crawl(
"http://en.wikipedia.org/wiki/Java_%28programming_language%29",
2, 10);
}
问题出在以下代码中:
final CompletableFuture<Void> allDoneFuture = allOf(
(CompletableFuture[]) of(startingUrl)
.map(url -> supplyAsync(getContent(url)))
.map(docFuture -> docFuture.thenApply(getURLs(breadth)))
.map(urlsFuture -> urlsFuture.thenApply(doForEach(depth, breadth)))
.toArray(size -> new CompletableFuture[size]));
出于某种原因,您在一个元素的流中执行所有这些操作(这是练习的一部分吗?)。结果是 allDoneFuture
没有跟踪子任务的完成情况。它正在跟踪来自 doForEach
的 Stream<CompletableFuture>
的完成情况。但是该流立即准备就绪,并且永远不会要求其中的期货完成。
通过删除没有任何帮助的流来修复它:
final CompletableFuture<Void> allDoneFuture=supplyAsync(getContent(startingUrl))
.thenApply(getURLs(breadth))
.thenApply(doForEach(depth,breadth))
.thenApply(futures -> futures.toArray(CompletableFuture[]::new))
.thenCompose(CompletableFuture::allOf);