在 Trie 树中找到最频繁出现的长度为 n 的前缀
Find the most frequent prefix of length n in Trie
我正在使用 Trie 数据结构并试图找到:
- 长度为n的最频繁前缀
- 长度为n或更多的最常见前缀
有一个类似的 post 但没有提供代码。上述 post 中建议的可能解决方案,我无法正确实施:
为每个 trie 节点添加一个计数器,并在使用该节点时递增该值,然后扫描节点以检查最频繁的前缀在哪里。
我可以统计每个节点的使用情况,但我不知道如何检查哪个前缀是最常用的,然后如何从节点中构建这个前缀。问题:如何根据计数器找到长度为n(或> = n)的最频繁前缀以及如何打印这样的前缀?
我的 trie 有数千个单词,但出于演示目的,我们可以坚持使用一组单词:
Trie t = new Trie();
t.insert("test");
t.insert("testify");
t.insert("television");
t.insert("car");
t.insert("cat");
t.insert("cabin");
t.insert("cab");
findMostFrequentPrefix(t.getRoot(), 2);
// expected output for provided data set:
// 1) returns "ca" for exact length of 2 // "ca" appears 4 times and "te" appears 3 times
// 2) (not exactly sure about this) returns "ca" (appears 4 times, the second most frequent "cab" appears only 2 times) for length 2 or more, however if there was an additional most frequent prefix e.g. "cad" (5 times appearance), the function should return this one, instead of "ca"
Trie 代码,Trie 节点:
public class TrieNode {
TrieNode[] children;
boolean isEnd;
char value;
int counter;
public TrieNode() {
this.children = new TrieNode[26];
}
public TrieNode getNode(char ch) {
if (this.children.length == 0) {
return null;
}
for (TrieNode child : this.children) {
if (child != null)
if (child.value == ch)
return child;
}
return null;
}
public TrieNode[] getChildren() {
return children;
}
public char getValue() {
return value;
}
}
public class Trie {
private TrieNode root;
public Trie() {
root = new TrieNode();
}
public TrieNode getRoot() {
return this.root;
}
// Inserts a word into the trie.
public void insert(String word) {
TrieNode p = root;
for (int i = 0; i < word.length(); i++) {
char c = word.charAt(i);
int index = c - 'a';
// p.counter++ incrementing the value which later will be used to examine the frequency
if (p.children[index] == null) {
TrieNode temp = new TrieNode();
p.children[index] = temp;
p.children[index].value = c;
p.counter++;
System.out.println("Counter: " + p.counter + " - for " + p.value);
p = temp;
} else {
p.counter++;
System.out.println("Counter: " + p.counter + " - for " + p.value);
p = p.children[index];
}
}
p.isEnd = true;
}
// Returns if the word is in the trie.
public boolean search(String word) {
TrieNode p = searchNode(word);
if (p == null) {
return false;
} else {
if (p.isEnd)
return true;
}
return false;
}
// Returns if there is any word in the trie
// that starts with the given prefix.
public boolean startsWith(String prefix) {
TrieNode p = searchNode(prefix);
if (p == null) {
return false;
} else {
return true;
}
}
public TrieNode searchNode(String s) {
TrieNode p = root;
for (int i = 0; i < s.length(); i++) {
char c = s.charAt(i);
int index = c - 'a';
if (p.children[index] != null) {
p = p.children[index];
} else {
return null;
}
}
if (p == root)
return null;
return p;
}
}
遍历trie并找到长度为n的前缀的方法
(应该有一个额外的函数来查找长度>= n的前缀):
public static void findMostFrequentPrefix(TrieNode current, int n) {
for (TrieNode temp : current.children) {
if(temp != null) {
// logic to examine which counter value is the greatest
// and then (maybe?) follow the idea until the length of n of prefix is reached
System.out.println(temp.value);
findMostFrequentPrefix(temp, n);
}
}
}
更新:对第二种情况(>= n)的更清晰的解释:
假设我们有一组数据:
test
testi
testif
testify
如果我想找到长度为 n = 2
的最常见前缀,我可以调用第一个答案中提供的函数:findMostFrequentPrefix(t.getRoot(), 2);
- 它会 return te
.
然而在第二种情况下,我想找到长度可以大于或等于n(>= n)
的最频繁的前缀。使用上面提供的数据,如果我调用一个新函数 findMostFrequentPrefixGreaterEqual(t.getRoot(), 1);
那么最频繁的前缀可以是:
1. n = 1 - "t" - max frequency = 4
2. n = 2 - "te" - max frequency = 4
3. n = 3 - "tes" - max frequency = 4
4. n = 4 - "test - max frequency = 4
这意味着频率相同。该函数可以指示存在相同的频率,但 return 长度最大的前缀,在这种情况下它将是:test
.
首先,我认为您在跟踪每个 TrieNode
的使用情况的代码中存在一个小而重要的错误。如果您检查 cab
中表示 b
的 TrieNode 的 counter
的值,您会看到它是 1,而它应该是 2,因为:
cab
cabin
您需要通过将 insert
方法末尾的代码更新为:
来记录表示单词中最后一个字符的节点的使用情况
p.counter++;
p.isEnd = true;
通过该更新,您可以获得最常见的前缀(可能不止一个),如下所示:
public static int findMostFrequentPrefix(TrieNode current, int n, int max, char[] curr, int depth,
List<Prefix> result)
{
if (n == 0)
{
if (current.counter >= max)
{
if (current.counter > max)
{
result.clear();
max = current.counter;
}
result.add(new Prefix(max, String.valueOf(curr)));
}
return max;
}
for (TrieNode child : current.children)
{
if (child != null)
{
curr[depth] = child.value;
max = Math.max(max, findMostFrequentPrefix(child, n - 1, max, curr, depth + 1, result));
}
}
return max;
}
使用小型 PrefIx
助手 class:
static class Prefix
{
int freq;
String value;
public Prefix(int freq, String value)
{
this.freq = freq;
this.value = value;
}
}
测试:
int len = 3;
List<Prefix> result = new ArrayList<>();
int max = findMostFrequentPrefix(t.getRoot(), len, 0, new char[len], 0, result);
System.out.format("max frequency: %d%n", max);
for(Prefix p : result)
System.out.println(p.value);
输出:
max frequency: 2
cab
tes
这是实施您的第二种方法的尝试,findMostFrequentPrefixGreaterEqual
。您必须测试并确认它 returns 在所有情况下的预期结果:
public static int findMostFrequentPrefixGreaterEqual(TrieNode current, int n, int max, char[] curr, int depth,
List<Prefix> result)
{
if (n <= 0)
{
if (current.counter >= max)
{
result.clear();
max = current.counter;
result.add(new Prefix(max, String.valueOf(curr, 0, depth)));
}
else
{
return max;
}
}
for (TrieNode child : current.children)
{
if (child != null)
{
curr[depth] = child.value;
max = Math.max(max, findMostFrequentPrefixGreaterEqual(child, n - 1, max, curr, depth + 1, result));
}
}
return max;
}
测试:
t.insert("test");
t.insert("testi");
t.insert("testif");
t.insert("testify");
int len = 2;
int maxLen = 7;
List<Prefix> result = new ArrayList<>();
int max = findMostFrequentPrefixGreaterEqual(t.getRoot(), len, 0, new char[maxLen], 0, result);
System.out.format("max frequency: %d%n", max);
for(Prefix p : result)
System.out.println(p.value);
请注意,我们现在必须知道 Trie
中任何单词的最大长度,以便在 char
数组中分配足够的 space。
输出:
max frequency: 4
test
我正在使用 Trie 数据结构并试图找到:
- 长度为n的最频繁前缀
- 长度为n或更多的最常见前缀
有一个类似的 post 但没有提供代码。上述 post 中建议的可能解决方案,我无法正确实施:
为每个 trie 节点添加一个计数器,并在使用该节点时递增该值,然后扫描节点以检查最频繁的前缀在哪里。
我可以统计每个节点的使用情况,但我不知道如何检查哪个前缀是最常用的,然后如何从节点中构建这个前缀。问题:如何根据计数器找到长度为n(或> = n)的最频繁前缀以及如何打印这样的前缀?
我的 trie 有数千个单词,但出于演示目的,我们可以坚持使用一组单词:
Trie t = new Trie();
t.insert("test");
t.insert("testify");
t.insert("television");
t.insert("car");
t.insert("cat");
t.insert("cabin");
t.insert("cab");
findMostFrequentPrefix(t.getRoot(), 2);
// expected output for provided data set:
// 1) returns "ca" for exact length of 2 // "ca" appears 4 times and "te" appears 3 times
// 2) (not exactly sure about this) returns "ca" (appears 4 times, the second most frequent "cab" appears only 2 times) for length 2 or more, however if there was an additional most frequent prefix e.g. "cad" (5 times appearance), the function should return this one, instead of "ca"
Trie 代码,Trie 节点:
public class TrieNode {
TrieNode[] children;
boolean isEnd;
char value;
int counter;
public TrieNode() {
this.children = new TrieNode[26];
}
public TrieNode getNode(char ch) {
if (this.children.length == 0) {
return null;
}
for (TrieNode child : this.children) {
if (child != null)
if (child.value == ch)
return child;
}
return null;
}
public TrieNode[] getChildren() {
return children;
}
public char getValue() {
return value;
}
}
public class Trie {
private TrieNode root;
public Trie() {
root = new TrieNode();
}
public TrieNode getRoot() {
return this.root;
}
// Inserts a word into the trie.
public void insert(String word) {
TrieNode p = root;
for (int i = 0; i < word.length(); i++) {
char c = word.charAt(i);
int index = c - 'a';
// p.counter++ incrementing the value which later will be used to examine the frequency
if (p.children[index] == null) {
TrieNode temp = new TrieNode();
p.children[index] = temp;
p.children[index].value = c;
p.counter++;
System.out.println("Counter: " + p.counter + " - for " + p.value);
p = temp;
} else {
p.counter++;
System.out.println("Counter: " + p.counter + " - for " + p.value);
p = p.children[index];
}
}
p.isEnd = true;
}
// Returns if the word is in the trie.
public boolean search(String word) {
TrieNode p = searchNode(word);
if (p == null) {
return false;
} else {
if (p.isEnd)
return true;
}
return false;
}
// Returns if there is any word in the trie
// that starts with the given prefix.
public boolean startsWith(String prefix) {
TrieNode p = searchNode(prefix);
if (p == null) {
return false;
} else {
return true;
}
}
public TrieNode searchNode(String s) {
TrieNode p = root;
for (int i = 0; i < s.length(); i++) {
char c = s.charAt(i);
int index = c - 'a';
if (p.children[index] != null) {
p = p.children[index];
} else {
return null;
}
}
if (p == root)
return null;
return p;
}
}
遍历trie并找到长度为n的前缀的方法 (应该有一个额外的函数来查找长度>= n的前缀):
public static void findMostFrequentPrefix(TrieNode current, int n) {
for (TrieNode temp : current.children) {
if(temp != null) {
// logic to examine which counter value is the greatest
// and then (maybe?) follow the idea until the length of n of prefix is reached
System.out.println(temp.value);
findMostFrequentPrefix(temp, n);
}
}
}
更新:对第二种情况(>= n)的更清晰的解释:
假设我们有一组数据:
test
testi
testif
testify
如果我想找到长度为 n = 2
的最常见前缀,我可以调用第一个答案中提供的函数:findMostFrequentPrefix(t.getRoot(), 2);
- 它会 return te
.
然而在第二种情况下,我想找到长度可以大于或等于n(>= n)
的最频繁的前缀。使用上面提供的数据,如果我调用一个新函数 findMostFrequentPrefixGreaterEqual(t.getRoot(), 1);
那么最频繁的前缀可以是:
1. n = 1 - "t" - max frequency = 4
2. n = 2 - "te" - max frequency = 4
3. n = 3 - "tes" - max frequency = 4
4. n = 4 - "test - max frequency = 4
这意味着频率相同。该函数可以指示存在相同的频率,但 return 长度最大的前缀,在这种情况下它将是:test
.
首先,我认为您在跟踪每个 TrieNode
的使用情况的代码中存在一个小而重要的错误。如果您检查 cab
中表示 b
的 TrieNode 的 counter
的值,您会看到它是 1,而它应该是 2,因为:
cab
cabin
您需要通过将 insert
方法末尾的代码更新为:
p.counter++;
p.isEnd = true;
通过该更新,您可以获得最常见的前缀(可能不止一个),如下所示:
public static int findMostFrequentPrefix(TrieNode current, int n, int max, char[] curr, int depth,
List<Prefix> result)
{
if (n == 0)
{
if (current.counter >= max)
{
if (current.counter > max)
{
result.clear();
max = current.counter;
}
result.add(new Prefix(max, String.valueOf(curr)));
}
return max;
}
for (TrieNode child : current.children)
{
if (child != null)
{
curr[depth] = child.value;
max = Math.max(max, findMostFrequentPrefix(child, n - 1, max, curr, depth + 1, result));
}
}
return max;
}
使用小型 PrefIx
助手 class:
static class Prefix
{
int freq;
String value;
public Prefix(int freq, String value)
{
this.freq = freq;
this.value = value;
}
}
测试:
int len = 3;
List<Prefix> result = new ArrayList<>();
int max = findMostFrequentPrefix(t.getRoot(), len, 0, new char[len], 0, result);
System.out.format("max frequency: %d%n", max);
for(Prefix p : result)
System.out.println(p.value);
输出:
max frequency: 2
cab
tes
这是实施您的第二种方法的尝试,findMostFrequentPrefixGreaterEqual
。您必须测试并确认它 returns 在所有情况下的预期结果:
public static int findMostFrequentPrefixGreaterEqual(TrieNode current, int n, int max, char[] curr, int depth,
List<Prefix> result)
{
if (n <= 0)
{
if (current.counter >= max)
{
result.clear();
max = current.counter;
result.add(new Prefix(max, String.valueOf(curr, 0, depth)));
}
else
{
return max;
}
}
for (TrieNode child : current.children)
{
if (child != null)
{
curr[depth] = child.value;
max = Math.max(max, findMostFrequentPrefixGreaterEqual(child, n - 1, max, curr, depth + 1, result));
}
}
return max;
}
测试:
t.insert("test");
t.insert("testi");
t.insert("testif");
t.insert("testify");
int len = 2;
int maxLen = 7;
List<Prefix> result = new ArrayList<>();
int max = findMostFrequentPrefixGreaterEqual(t.getRoot(), len, 0, new char[maxLen], 0, result);
System.out.format("max frequency: %d%n", max);
for(Prefix p : result)
System.out.println(p.value);
请注意,我们现在必须知道 Trie
中任何单词的最大长度,以便在 char
数组中分配足够的 space。
输出:
max frequency: 4
test