(Java) 获取google的前n个结果作为链接
(Java) Get the first n results of google as links
首先,我查找了类似的问题,但没有找到我需要的答案。所以,如果这个问题不是唯一的和新的,请原谅我。
我想获得 google 的前 N(可能是 5 或 10)个结果作为链接。
目前我有这样的东西:
String url="http://www.google.com/search?q=";
String charset="UTF-8";
String key="java";
String query = String.format("%s",URLEncoder.encode(key, charset));
URLConnection con = new URL(url+ query).openConnection();
//next line is to trick Google who is blocking the default UserAgent
con.setRequestProperty("User-Agent", "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.4; en-US; rv:1.9.2.2) Gecko/20100316 Firefox/3.6.2");
BufferedReader in = new BufferedReader(new InputStreamReader(con.getInputStream()));
String inputLine;
while ((inputLine = in.readLine()) != null)
System.out.println(inputLine);
in.close();
这给了我这个搜索的完整 google html 代码,但我只想得到前 n 个结果的原始链接。我该如何管理?
提前致谢。
我做了一些 html 调查,您必须在字符串中搜索:
<h3 class="r"><a href="/url?q=
之后是 link,继续到双引号。我会尽快制作一个脚本。
编辑
在 google 中搜索字符串键时,这应该得到前 n links:
public static String[] getLinks(String key, int n) throws MalformedURLException, IOException {
String url = "http://www.google.com/search?q=";
String charset = "UTF-8";
String query = String.format("%s", URLEncoder.encode(key, charset));
URLConnection con = new URL(url + query).openConnection();
con.setRequestProperty("User-Agent",
"Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.4; en-US; rv:1.9.2.2) Gecko/20100316 Firefox/3.6.2");
BufferedReader in = new BufferedReader(new InputStreamReader(con.getInputStream()));
String inputLine;
String wholeThing = "";
while ((inputLine = in.readLine()) != null) wholeThing += inputLine;
in.close();
List<String> strings = new ArrayList<String>();
String search = "<h3 class=\"r\"><a href=\"/url?q=";
int stringsFound = 0;
int searchChar = search.length();
while(stringsFound < n && searchChar <= wholeThing.length()) {
if(wholeThing.substring(searchChar - search.length(), searchChar).equals(search)) {
int endSearch = 0;
while(!wholeThing.substring(searchChar + endSearch, searchChar + endSearch + 4).equals("&")) {
endSearch++;
}
strings.add(wholeThing.substring(searchChar, searchChar + endSearch));
stringsFound++;
}
searchChar++;
}
String[] out = new String[strings.size()];
for(int i = 0; i < strings.size(); i++) {
out[i] = strings.get(i);
}
return out;
}
确保导入 java.util.list,而不是 java.awt.list!
您可能想尝试 jsoup 库,因为它需要花费很多精力来解析网页:
Elements links = Jsoup.connect("https://www.google.com.au/search?q=fred")
.get().select("h3.r").select("a");
for (Element link : links)
System.out.println(link);
Elements
扩展了 ArrayList<Element>
因此您可以使用以下方法访问前 n 个元素:
for (int i = 0; i < n; i++)
System.out.println(links.get(i));
或者,使用流:
links.stream().limit(n)...
如果您只想要原始 url:
link.attr("href")
因此将所有这些放在一起,以下内容将打印 google 搜索术语 "fred" 的前 5 个原始链接:
Jsoup.connect("https://www.google.com.au/search?q=fred").get()
.select("h3.r").select("a")
.stream()
.limit(5)
.map(l -> l.attr("href"))
.forEach(System.out::println);
首先,我查找了类似的问题,但没有找到我需要的答案。所以,如果这个问题不是唯一的和新的,请原谅我。
我想获得 google 的前 N(可能是 5 或 10)个结果作为链接。 目前我有这样的东西:
String url="http://www.google.com/search?q=";
String charset="UTF-8";
String key="java";
String query = String.format("%s",URLEncoder.encode(key, charset));
URLConnection con = new URL(url+ query).openConnection();
//next line is to trick Google who is blocking the default UserAgent
con.setRequestProperty("User-Agent", "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.4; en-US; rv:1.9.2.2) Gecko/20100316 Firefox/3.6.2");
BufferedReader in = new BufferedReader(new InputStreamReader(con.getInputStream()));
String inputLine;
while ((inputLine = in.readLine()) != null)
System.out.println(inputLine);
in.close();
这给了我这个搜索的完整 google html 代码,但我只想得到前 n 个结果的原始链接。我该如何管理?
提前致谢。
我做了一些 html 调查,您必须在字符串中搜索:
<h3 class="r"><a href="/url?q=
之后是 link,继续到双引号。我会尽快制作一个脚本。
编辑
在 google 中搜索字符串键时,这应该得到前 n links:
public static String[] getLinks(String key, int n) throws MalformedURLException, IOException {
String url = "http://www.google.com/search?q=";
String charset = "UTF-8";
String query = String.format("%s", URLEncoder.encode(key, charset));
URLConnection con = new URL(url + query).openConnection();
con.setRequestProperty("User-Agent",
"Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.4; en-US; rv:1.9.2.2) Gecko/20100316 Firefox/3.6.2");
BufferedReader in = new BufferedReader(new InputStreamReader(con.getInputStream()));
String inputLine;
String wholeThing = "";
while ((inputLine = in.readLine()) != null) wholeThing += inputLine;
in.close();
List<String> strings = new ArrayList<String>();
String search = "<h3 class=\"r\"><a href=\"/url?q=";
int stringsFound = 0;
int searchChar = search.length();
while(stringsFound < n && searchChar <= wholeThing.length()) {
if(wholeThing.substring(searchChar - search.length(), searchChar).equals(search)) {
int endSearch = 0;
while(!wholeThing.substring(searchChar + endSearch, searchChar + endSearch + 4).equals("&")) {
endSearch++;
}
strings.add(wholeThing.substring(searchChar, searchChar + endSearch));
stringsFound++;
}
searchChar++;
}
String[] out = new String[strings.size()];
for(int i = 0; i < strings.size(); i++) {
out[i] = strings.get(i);
}
return out;
}
确保导入 java.util.list,而不是 java.awt.list!
您可能想尝试 jsoup 库,因为它需要花费很多精力来解析网页:
Elements links = Jsoup.connect("https://www.google.com.au/search?q=fred")
.get().select("h3.r").select("a");
for (Element link : links)
System.out.println(link);
Elements
扩展了 ArrayList<Element>
因此您可以使用以下方法访问前 n 个元素:
for (int i = 0; i < n; i++)
System.out.println(links.get(i));
或者,使用流:
links.stream().limit(n)...
如果您只想要原始 url:
link.attr("href")
因此将所有这些放在一起,以下内容将打印 google 搜索术语 "fred" 的前 5 个原始链接:
Jsoup.connect("https://www.google.com.au/search?q=fred").get()
.select("h3.r").select("a")
.stream()
.limit(5)
.map(l -> l.attr("href"))
.forEach(System.out::println);