在网页上查找最常用的词(使用 Jsoup)?
Find most frequent words on a webpage (using Jsoup)?
在我的项目中,我必须计算维基百科文章中出现频率最高的单词。我找到了用于解析HTML格式的Jsoup,但仍然存在词频问题。 Jsoup 中是否有计算单词出现频率的函数,或者使用 Jsoup 查找网页上出现频率最高的单词的任何方法?
谢谢。
是的,您可以使用 Jsoup 从网页中获取文本,如下所示:
Document doc = Jsoup.connect("http://en.wikipedia.org/").get();
String text = doc.body().text();
然后,你需要统计单词,找出出现频率最高的单词。 This code 看起来很有希望。我们需要修改它以使用 Jsoup 的字符串输出,如下所示:
import java.io.*;
import java.nio.charset.StandardCharsets;
import java.util.*;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
public class JsoupWordCount {
public static void main(String[] args) throws IOException {
long time = System.currentTimeMillis();
Map<String, Word> countMap = new HashMap<String, Word>();
//connect to wikipedia and get the HTML
System.out.println("Downloading page...");
Document doc = Jsoup.connect("http://en.wikipedia.org/").get();
//Get the actual text from the page, excluding the HTML
String text = doc.body().text();
System.out.println("Analyzing text...");
//Create BufferedReader so the words can be counted
BufferedReader reader = new BufferedReader(new InputStreamReader(new ByteArrayInputStream(text.getBytes(StandardCharsets.UTF_8))));
String line;
while ((line = reader.readLine()) != null) {
String[] words = line.split("[^A-ZÅÄÖa-zåäö]+");
for (String word : words) {
if ("".equals(word)) {
continue;
}
Word wordObj = countMap.get(word);
if (wordObj == null) {
wordObj = new Word();
wordObj.word = word;
wordObj.count = 0;
countMap.put(word, wordObj);
}
wordObj.count++;
}
}
reader.close();
SortedSet<Word> sortedWords = new TreeSet<Word>(countMap.values());
int i = 0;
int maxWordsToDisplay = 10;
String[] wordsToIgnore = {"the", "and", "a"};
for (Word word : sortedWords) {
if (i >= maxWordsToDisplay) { //10 is the number of words you want to show frequency for
break;
}
if (Arrays.asList(wordsToIgnore).contains(word.word)) {
i++;
maxWordsToDisplay++;
} else {
System.out.println(word.count + "\t" + word.word);
i++;
}
}
time = System.currentTimeMillis() - time;
System.out.println("Finished in " + time + " ms");
}
public static class Word implements Comparable<Word> {
String word;
int count;
@Override
public int hashCode() { return word.hashCode(); }
@Override
public boolean equals(Object obj) { return word.equals(((Word)obj).word); }
@Override
public int compareTo(Word b) { return b.count - count; }
}
}
输出:
Downloading page...
Analyzing text...
42 of
24 in
20 Wikipedia
19 to
16 is
11 that
10 The
9 was
8 articles
7 featured
Finished in 3300 ms
一些注意事项:
这段代码可以忽略一些单词,比如"the"、"and"、"a"等,你需要自定义它。
好像有时候unicode字符有问题。虽然我没有遇到过,但是评论里有人遇到过。
这可以用更少的代码做得更好。
未经过充分测试。
尽情享受吧!
在我的项目中,我必须计算维基百科文章中出现频率最高的单词。我找到了用于解析HTML格式的Jsoup,但仍然存在词频问题。 Jsoup 中是否有计算单词出现频率的函数,或者使用 Jsoup 查找网页上出现频率最高的单词的任何方法?
谢谢。
是的,您可以使用 Jsoup 从网页中获取文本,如下所示:
Document doc = Jsoup.connect("http://en.wikipedia.org/").get();
String text = doc.body().text();
然后,你需要统计单词,找出出现频率最高的单词。 This code 看起来很有希望。我们需要修改它以使用 Jsoup 的字符串输出,如下所示:
import java.io.*;
import java.nio.charset.StandardCharsets;
import java.util.*;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
public class JsoupWordCount {
public static void main(String[] args) throws IOException {
long time = System.currentTimeMillis();
Map<String, Word> countMap = new HashMap<String, Word>();
//connect to wikipedia and get the HTML
System.out.println("Downloading page...");
Document doc = Jsoup.connect("http://en.wikipedia.org/").get();
//Get the actual text from the page, excluding the HTML
String text = doc.body().text();
System.out.println("Analyzing text...");
//Create BufferedReader so the words can be counted
BufferedReader reader = new BufferedReader(new InputStreamReader(new ByteArrayInputStream(text.getBytes(StandardCharsets.UTF_8))));
String line;
while ((line = reader.readLine()) != null) {
String[] words = line.split("[^A-ZÅÄÖa-zåäö]+");
for (String word : words) {
if ("".equals(word)) {
continue;
}
Word wordObj = countMap.get(word);
if (wordObj == null) {
wordObj = new Word();
wordObj.word = word;
wordObj.count = 0;
countMap.put(word, wordObj);
}
wordObj.count++;
}
}
reader.close();
SortedSet<Word> sortedWords = new TreeSet<Word>(countMap.values());
int i = 0;
int maxWordsToDisplay = 10;
String[] wordsToIgnore = {"the", "and", "a"};
for (Word word : sortedWords) {
if (i >= maxWordsToDisplay) { //10 is the number of words you want to show frequency for
break;
}
if (Arrays.asList(wordsToIgnore).contains(word.word)) {
i++;
maxWordsToDisplay++;
} else {
System.out.println(word.count + "\t" + word.word);
i++;
}
}
time = System.currentTimeMillis() - time;
System.out.println("Finished in " + time + " ms");
}
public static class Word implements Comparable<Word> {
String word;
int count;
@Override
public int hashCode() { return word.hashCode(); }
@Override
public boolean equals(Object obj) { return word.equals(((Word)obj).word); }
@Override
public int compareTo(Word b) { return b.count - count; }
}
}
输出:
Downloading page...
Analyzing text...
42 of
24 in
20 Wikipedia
19 to
16 is
11 that
10 The
9 was
8 articles
7 featured
Finished in 3300 ms
一些注意事项:
这段代码可以忽略一些单词,比如"the"、"and"、"a"等,你需要自定义它。
好像有时候unicode字符有问题。虽然我没有遇到过,但是评论里有人遇到过。
这可以用更少的代码做得更好。
未经过充分测试。
尽情享受吧!