我应该如何修改以解析 Google 新闻搜索文章标题 & 预览 & URL?
How should I modify to parse Google news search article title & preview & URL?
我要解析Google新闻搜索:1)文章名称2)预览3)URL
要执行此操作,我应该修改网站结构。
Elements links = Jsoup.connect(google + URLEncoder.encode(search , charset) + news).userAgent(userAgent).get().select( ".g>.r>.a");
主要在这里:
( ".g>.r>.a")
如何修改?
完整代码:
public static void main(String[] args) throws UnsupportedEncodingException, IOException {
String google = "http://www.google.com/search?q=";
String search = "Whosebug";
String charset = "UTF-8";
String news="&tbm=nws";
String userAgent = "ExampleBot 1.0 (+http://example.com/bot)"; // Change this to your company's name and bot homepage!
Elements links = Jsoup.connect(google + URLEncoder.encode(search , charset) + news).userAgent(userAgent).get().select( ".g>.r>.a");
for (Element link : links) {
String title = link.text();
String url = link.absUrl("href"); // Google returns URLs in format "http://www.google.com/url?q=<url>&sa=U&ei=<someKey>".
url = URLDecoder.decode(url.substring(url.indexOf('=') + 1, url.indexOf('&')), "UTF-8");
if (!url.startsWith("http")) {
continue; // Ads/news/etc.
}
System.out.println("Title: " + title);
System.out.println("URL: " + url);
}
}
更新
如何select正确的元素(使用chrome)
第一步:在您的浏览器中禁用 javascript(例如,为方便起见,使用像 uMatrix 这样的插件),这样您会看到与 jsoup 相同的结果。
现在右键单击一个元素并选择检查或使用 Ctrl+Shift+I 打开开发工具。当您将鼠标悬停在“元素”选项卡中的源代码上时,您会在呈现的页面中看到相关元素。右键单击源中的 n 元素提供复制 -> 复制 select 或。这是一个很好的起点,但有时过于严格。这里它给出了 select 或 #rso > div:nth-child(3)
所以第三个直接 child div 在一个 id 为 rso 的元素中。那太具体了,所以我们概括一下:
我们select所有直接child divs用于id为rso #rso > div
.
的元素
然后我们抓取标题锚点 h3 > a
,文本节点和属性 href
导致标题和 url.
接下来我们使用 class st (div.st
) 获取内部 div,它在其文本节点中包含预览。如果缺少 div,我们将跳过该元素。
在请求中使用.data("key","value")
,我们不需要手动编码。
示例代码
String userAgent = "Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.116 Safari/537.36";
String searchTerm = "Whosebug";
int numberOfResultpages = 2; // grabs first two pages of search results
String searchUrl = "https://www.google.com/search?";
Document doc;
for (int i = 0; i < numberOfResultpages; i++) {
try {
doc = Jsoup.connect(searchUrl)
.userAgent(userAgent)
.data("q", searchTerm)
.data("tbm", "nws")
.data("start",""+i)
.method(Method.GET)
.referrer("https://www.google.com/").get();
for (Element result : doc.select("#rso > div")) {
if(result.select("div.st").size()==0) continue;
Element h3a = result.select("h3 > a").first();
String title = h3a.text();
String url = h3a.attr("href");
String preview = result.select("div.st").first().text();
// just printing out title and link to demonstate the approach
System.out.println(title + " -> " + url + "\n\t" + preview);
}
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
输出
Stack Overflow: Movie Magic -> https://geekdad.com/2016/09/stack-overflow-movie-magic-2/
I got to visit the set of Kubo and the Two Strings and see some of the amazing work that went into creating the film. But well before the ...
Will Whosebug Documentation Realize Its Lofty Goal? -> https://dzone.com/articles/will-Whosebug-documentation-realize-its-lofty
With the Whosebug Documentation project now in beta, how close is it to realizing the lofty goals it has set forth for itself? Can it ever ...
Stack Overflow: Progress Report -> https://geekdad.com/2016/09/stack-overflow-progress-report/
Of the books on my list, the only one I totally finished so far is Kidding Ourselves, which I included in this Stack Overflow. And that perhaps is an ...
....
我要解析Google新闻搜索:1)文章名称2)预览3)URL
要执行此操作,我应该修改网站结构。
Elements links = Jsoup.connect(google + URLEncoder.encode(search , charset) + news).userAgent(userAgent).get().select( ".g>.r>.a");
主要在这里:
( ".g>.r>.a")
如何修改?
完整代码:
public static void main(String[] args) throws UnsupportedEncodingException, IOException {
String google = "http://www.google.com/search?q=";
String search = "Whosebug";
String charset = "UTF-8";
String news="&tbm=nws";
String userAgent = "ExampleBot 1.0 (+http://example.com/bot)"; // Change this to your company's name and bot homepage!
Elements links = Jsoup.connect(google + URLEncoder.encode(search , charset) + news).userAgent(userAgent).get().select( ".g>.r>.a");
for (Element link : links) {
String title = link.text();
String url = link.absUrl("href"); // Google returns URLs in format "http://www.google.com/url?q=<url>&sa=U&ei=<someKey>".
url = URLDecoder.decode(url.substring(url.indexOf('=') + 1, url.indexOf('&')), "UTF-8");
if (!url.startsWith("http")) {
continue; // Ads/news/etc.
}
System.out.println("Title: " + title);
System.out.println("URL: " + url);
}
}
更新
如何select正确的元素(使用chrome)
第一步:在您的浏览器中禁用 javascript(例如,为方便起见,使用像 uMatrix 这样的插件),这样您会看到与 jsoup 相同的结果。
现在右键单击一个元素并选择检查或使用 Ctrl+Shift+I 打开开发工具。当您将鼠标悬停在“元素”选项卡中的源代码上时,您会在呈现的页面中看到相关元素。右键单击源中的 n 元素提供复制 -> 复制 select 或。这是一个很好的起点,但有时过于严格。这里它给出了 select 或 #rso > div:nth-child(3)
所以第三个直接 child div 在一个 id 为 rso 的元素中。那太具体了,所以我们概括一下:
我们select所有直接child divs用于id为rso #rso > div
.
然后我们抓取标题锚点 h3 > a
,文本节点和属性 href
导致标题和 url.
接下来我们使用 class st (div.st
) 获取内部 div,它在其文本节点中包含预览。如果缺少 div,我们将跳过该元素。
在请求中使用.data("key","value")
,我们不需要手动编码。
示例代码
String userAgent = "Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.116 Safari/537.36";
String searchTerm = "Whosebug";
int numberOfResultpages = 2; // grabs first two pages of search results
String searchUrl = "https://www.google.com/search?";
Document doc;
for (int i = 0; i < numberOfResultpages; i++) {
try {
doc = Jsoup.connect(searchUrl)
.userAgent(userAgent)
.data("q", searchTerm)
.data("tbm", "nws")
.data("start",""+i)
.method(Method.GET)
.referrer("https://www.google.com/").get();
for (Element result : doc.select("#rso > div")) {
if(result.select("div.st").size()==0) continue;
Element h3a = result.select("h3 > a").first();
String title = h3a.text();
String url = h3a.attr("href");
String preview = result.select("div.st").first().text();
// just printing out title and link to demonstate the approach
System.out.println(title + " -> " + url + "\n\t" + preview);
}
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
输出
Stack Overflow: Movie Magic -> https://geekdad.com/2016/09/stack-overflow-movie-magic-2/
I got to visit the set of Kubo and the Two Strings and see some of the amazing work that went into creating the film. But well before the ...
Will Whosebug Documentation Realize Its Lofty Goal? -> https://dzone.com/articles/will-Whosebug-documentation-realize-its-lofty
With the Whosebug Documentation project now in beta, how close is it to realizing the lofty goals it has set forth for itself? Can it ever ...
Stack Overflow: Progress Report -> https://geekdad.com/2016/09/stack-overflow-progress-report/
Of the books on my list, the only one I totally finished so far is Kidding Ourselves, which I included in this Stack Overflow. And that perhaps is an ...
....