在字符串列表中查找最长公共 Prefix/Suffix 的出现次数?
Find occurrence count of the longest common Prefix/Suffix in a List of Strings?
给定一个字符串列表:
ArrayList<String> strList = new ArrayList<String>();
strList.add("Mary had a little lamb named Willy");
strList.add("Mary had a little ham");
strList.add("Old McDonald had a farm named Willy");
strList.add("Willy had a little dog named ham");
strList.add("(abc)");
strList.add("(xyz)");
strList.add("Visit Target Store");
strList.add("Visit Walmart Store");
这应该以 HashMap<String, Integer>
prefixMap
和 suffixMap
:
的形式产生输出
前缀:
Mary had a -> 2
Mary had a little -> 2
( -> 2
Visit -> 2
后缀:
named Willy -> 2
ham -> 2
) -> 2
Store -> 2
到目前为止,我可以使用以下代码生成一个出现在列表中所有项目中的前缀:
public static final int INDEX_NOT_FOUND = -1;
public static String getAllCommonPrefixesInList(final String... strs) {
if (strs == null || strs.length == 0) {
return EMPTY_STRING;
}
final int smallestIndexOfDiff = getIndexOfDifference(strs);
if (smallestIndexOfDiff == INDEX_NOT_FOUND) {
// All Strings are identical
if (strs[0] == null) {
return EMPTY_STRING;
}
return strs[0];
} else if (smallestIndexOfDiff == 0) {
// No common initial characters found, return empty String
return EMPTY_STRING;
} else {
// Common initial character sequence found, return sequence
return strs[0].substring(0, smallestIndexOfDiff);
}
}
public static int getIndexOfDifference(final CharSequence... charSequence) {
if (charSequence == null || charSequence.length <= 1) {
return INDEX_NOT_FOUND;
}
boolean isAnyStringNull = false;
boolean areAllStringsNull = true;
final int arrayLen = charSequence.length;
int shortestStrLen = Integer.MAX_VALUE;
int longestStrLen = 0;
// Find the min and max string lengths - avoids having to check that we are not exceeding the length of the string each time through the bottom loop.
for (int i = 0; i < arrayLen; i++) {
if (charSequence[i] == null) {
isAnyStringNull = true;
shortestStrLen = 0;
} else {
areAllStringsNull = false;
shortestStrLen = Math.min(charSequence[i].length(), shortestStrLen);
longestStrLen = Math.max(charSequence[i].length(), longestStrLen);
}
}
// Deals with lists containing all nulls or all empty strings
if (areAllStringsNull || longestStrLen == 0 && !isAnyStringNull) {
return INDEX_NOT_FOUND;
}
// Handle lists containing some nulls or some empty strings
if (shortestStrLen == 0) {
return 0;
}
// Find the position with the first difference across all strings
int firstDiff = -1;
for (int stringPos = 0; stringPos < shortestStrLen; stringPos++) {
final char comparisonChar = charSequence[0].charAt(stringPos);
for (int arrayPos = 1; arrayPos < arrayLen; arrayPos++) {
if (charSequence[arrayPos].charAt(stringPos) != comparisonChar) {
firstDiff = stringPos;
break;
}
}
if (firstDiff != -1) {
break;
}
}
if (firstDiff == -1 && shortestStrLen != longestStrLen) {
// We compared all of the characters up to the length of the
// shortest string and didn't find a match, but the string lengths
// vary, so return the length of the shortest string.
return shortestStrLen;
}
return firstDiff;
}
但是,我的目标是包含任何 prefix/suffix,结果地图中至少出现 2+
次。
Java
如何实现?
使用 trie.
应该很容易解决这个问题
trie 节点基本上应该跟踪两件事:
- 子节点
- 以当前节点结束的前缀数
在 trie 中插入所有字符串,这将在 O(string length * number of strings)
中完成。之后,只需遍历 trie,您就可以根据您的用例根据计数对前缀进行哈希处理。对于后缀,您可以使用相同的方法,只需开始以相反的顺序遍历字符串即可。
编辑:
转念一想,trie 可能是最有效的方法,但一个简单的 hashmap 实现也应该在这里起作用。下面是生成所有计数 > 1 的前缀的示例。
import java.util.*;
import java.util.stream.*;
class Main {
public static void main(String[] args) {
System.out.println("Hello world!");
ArrayList<String> strList = new ArrayList<String>();
strList.add("Mary had a little lamb named Willy");
strList.add("Mary had a little ham");
strList.add("Old McDonald had a farm named Willy");
strList.add("Willy had a little dog named ham");
strList.add("(abc)");
strList.add("(xyz)");
strList.add("Visit Target Store");
strList.add("Visit Walmart Store");
Map<String, Integer> prefixMap = new HashMap<String, Integer>();
ArrayList<String> stringsWithHighestOccurrence = new ArrayList<String>();
for (String word : strList) {
for (int i = 1; i <= word.length(); i++){
String prefix = word.substring(0, i);
prefixMap.merge(prefix, 1, Integer::sum);
}
}
Integer maxval = Collections.max(prefixMap.values());
for (String key: prefixMap.keySet()){
Integer value = prefixMap.get(key);
if (value > 1) System.out.println(key + " : " + value);
if (value == maxval) stringsWithHighestOccurrence.add(key);
}
int maxLength = stringsWithHighestOccurrence.stream().map(String::length).max(Integer::compareTo).get();
System.out.println(maxLength);
ArrayList<String> prefixesWithMaxLength =
stringsWithHighestOccurrence.stream().filter(c -> c.length() == maxLength).collect(Collectors.toCollection(ArrayList::new));
System.out.println(prefixesWithMaxLength);
}
}
尽管如此,为了完成,我还将添加一个基本的 TrieNode
实现,因为我的回答首先提出了这种方法。
TrieNode
:
class TrieNode {
private final Map<Character, TrieNode> children = new HashMap<>();
private int count;
Map<Character, TrieNode> getChildren() {
return children;
}
boolean getCount() {
return count;
}
void increaseCount() {
this.count += 1;
}
}
Trie
:
class Trie {
private TrieNode root;
Trie() {
root = new TrieNode();
}
void insert(String word) {
TrieNode current = root;
for (char l : word.toCharArray()) {
current = current.getChildren().computeIfAbsent(l, c -> new TrieNode());
current.increaseCount()
}
}
}
遍历 trie 类似于一个简单的 DFS 场景,其中还维护了到当前节点的“路径”(我们正在切换带有前缀的路径到这一点)。
我认为@Abhinav 提供的解决方案应该可以使用 HashMap。在这里,我将 post 使用 java 中的简单 Trie 实现的解决方案(进行一些自定义,例如将 freq 添加到 Trie 节点中)。
ArrayList<String> strList = new ArrayList<String>();
strList.add("Mary had a little lamb named Willy");
strList.add("Mary had a little ham");
strList.add("Old McDonald had a farm named Willy");
strList.add("Willy had a little dog named ham");
strList.add("( abc )");
strList.add("( xyz )");
strList.add("Visit Target Store");
strList.add("Visit Walmart Store");
TNode root = new TNode("");
int maxFreq = 1;
for(String sentence : strList) {
TNode currentNode = root;
String[] words = sentence.split(" "); // Assuming space character is the delimiter
for(String word: words) {
if(currentNode.children.containsKey(word)) {
currentNode.children.get(word).freq += 1;
maxFreq = Math.max(maxFreq, currentNode.children.get(word).freq);
} else {
TNode c = new TNode(word);
c.freq = 1;
currentNode.children.put(word, c);
}
currentNode = currentNode.children.get(word);
}
}
Map<String, Integer> result = new HashMap<String, Integer>();
Queue<NodeWithPrefix> queue = new LinkedList<NodeWithPrefix>();
for(TNode node : root.children.values()){
NodeWithPrefix nwp = new NodeWithPrefix(node);
nwp.prefix = "";
queue.add(nwp);
}
while(!queue.isEmpty()) {
NodeWithPrefix item = queue.poll();
if(item.node.freq == maxFreq) {
result.put(item.prefix + " " + item.node.value, item.node.freq);
}
for(TNode child : item.node.children.values()) {
NodeWithPrefix nwp = new NodeWithPrefix(child);
nwp.prefix = item.prefix + " " + item.node.value;
queue.add(nwp);
}
}
return result;
此算法需要另外 2 个 classes:
class NodeWithPrefix {
String prefix;
TNode node;
public NodeWithPrefix(TNode node){
this.node = node;
}
}
class TNode {
String value;
int freq = 0;
Map<String, TNode> children;
public TNode(String value){
this.value = value;
children = new HashMap<String, TNode>();
}
}
输出是for prefix:(for postfix应该类似,只需要向后构建Trie)
{ Mary had=2, Mary had a=2, Visit=2, (=2, Mary had a little=2, Mary=2}
这里我使用 BFS 来检索 Trie 中频率等于 maxFreq 的所有子字符串。我们可以根据需要调整过滤条件。您也可以在这里进行 DFS。
其他考虑是我们可以将前缀添加到 TNode class 本身,我更喜欢将它单独放在另一个 class.
中
根据我对这个问题的理解,最适合解决它的数据结构是非循环不相交的 Graph.
在一般情况下,一个图将由几个不相连的簇组成。每个簇都会有一个tree-like结构,在边缘情况下它会形成一个链表.
基本上,解决这个问题的最简单朴素方法是根据每一行创建一堆链表 ,并遍历它们。缺点是:重复节点(更大的内存消耗),更大的time-complexity(需要更多的操作)并且更多error-prone 因为需要更多的手动操作。
图表的描述
所以我会坚持使用 graph 作为这个问题的数据结构,并尽量让事情尽可能简单。
让我们考虑以下输入:
"Mary had a little lamb named Willy"
"Mary had a little ham"
"A B C"
图形的图形表示将如下所示;
前两行将构成一个簇,由链表(头部)和组成tree(尾部)。第二个簇将由一个链表表示,它的顶点不与由其他字符串形成的顶点相连。
这不是构造顶点的唯一方式,头部可以产生一个 N-tree 并且可以在中间某处观察到链表。
要点是,为了解决问题,我们需要通过所有分支跟踪顶点链,直到 顶点重叠 。在 graph 的这些部分中,两条或多条线之间共有的每个 prefix-strings 和 suffix-string 将由 单个顶点表示 (节点).
为了维护映射到特定 顶点 的字符串的数量,每个顶点都应该有一个变量(int groupCount
在下面的代码中),当创建 顶点 时,它被分配默认值 1
并且每次新字符串映射到该顶点时递增。
每个 顶点 都包含一张地图,其中包含对其 邻居 的引用。添加新的 neighbour-vertex 时,根据给定的字符串创建新的 Vertex
或 count现有的 顶点 递增。
为了符合这个任务,图表应该保持对所有head-vertexes和[=89=的引用]tail-vertexes。为了简单起见,不是维护两组对相邻节点的引用,而是在每个顶点中维护两个单独的计数变量(因为suffix-count和prefix-count会不同),这个解决方案图实际上由两个图组成(suffix-graph和prefix-graph)。出于这个原因,实现 class 命名为 MultiGraph
.
为了用顶点[=填充suffix-graph和prefix-graph 153=],方法 addCluster()
通过 Iterator
以正常或相反的顺序遍历给定行的字符串,具体取决于正在填充哪个 graph .
深度优先搜索
图形填充后的下一步是生成频率为 2
和更大的字符串映射。
为此,正在使用 classical depth first search 算法。
为了实现 DFS,需要一个将用作 堆栈 的可变容器(ArrayDeque
正用于该目的)。从 heads/tails 的 map 中取出的第一个元素将被放置在 stack 的顶部,并且 StringBuilder
持有这个元素的名称将被放置在地图中。
然后,要恢复具有特定 count 的字符串,将从 的顶部弹出 vertexes stack 和它们的 neighbors 和 count > 1
依次将被放置在堆栈的顶部。 current prefix 的副本加上分隔符和邻居的名字将被映射到 neighbour-vertex.
如果计数发生变化,则表示当前前缀表示至少两行之间的最长公共字符串。在这种情况下,前缀和计数被添加到生成的映射中。
实施
下面的实现由两个class组成,分别是narrow-focused和self-contained . MultiGraph
class 专门用作数据结构,维护两个图。 pluming代码,像拆分字符串的行一样,被提取到一个单独的class GraphManager
.
图表
public class MultiGraph {
private final Map<String, Vertex> heads = new HashMap<>();
private final Map<String, Vertex> tails = new HashMap<>();
public void addCluster(Deque<String> names) {
addCluster(heads, names.iterator());
addCluster(tails, names.descendingIterator());
}
private void addCluster(Map<String, Vertex> clusters, Iterator<String> names) {
String rootName = names.next();
if (clusters.containsKey(rootName)) {
clusters.get(rootName).incrementGroupCount();
} else {
clusters.put(rootName, new Vertex(rootName));
}
Vertex current = clusters.get(rootName);
while (names.hasNext()) {
current = current.addNext(names.next());
}
}
public Map<String, Integer> generatePrefixMap(String delimiter) {
Map<String, Integer> countByPrefix = new HashMap<>();
for (Vertex next: heads.values()) {
if (next.getGroupCount() == 1) {
continue;
}
performDFS(heads, countByPrefix, delimiter, next);
}
return countByPrefix;
}
public Map<String, Integer> generateSuffixMap(String delimiter) {
Map<String, Integer> countBySuffix = new HashMap<>();
for (Vertex next: tails.values()) {
if (next.getGroupCount() == 1) {
continue;
}
performDFS(tails, countBySuffix, delimiter, next);
}
return countBySuffix;
}
// implementation of the Depth First Search algorithm
public void performDFS(Map<String, Vertex> clusters,
Map<String, Integer> countByPrefix,
String delimiter, Vertex next) {
StringBuilder prefix = null;
Vertex current = next;
int count = next.getGroupCount();
Deque<Vertex> stack = new ArrayDeque<>(); // create as stack
Map<Vertex, StringBuilder> prefixByVert = new HashMap<>();
stack.push(next); // place the first element on the stack
prefixByVert.put(current, new StringBuilder(current.getName()));
while (!stack.isEmpty()) {
current = stack.pop();
if (current.getGroupCount() < count) { // the number of strings mapped to the current Vertex has been changed
countByPrefix.put(prefix.toString(), count); // saving the result
count = current.getGroupCount();
}
prefix = prefixByVert.get(current);
for (Vertex neighbour: current.getNextVertByVal().values()) {
if (next.getGroupCount() == 1) {
continue;
}
stack.push(neighbour);
prefixByVert.put(neighbour, new StringBuilder(prefix)
.append(delimiter)
.append(neighbour.getName()));
}
}
if (prefix != null && count > 1) {
countByPrefix.putIfAbsent(prefix.toString(), count);
}
}
private static class Vertex {
private final String name;
private int groupCount = 1;
private final Map<String, Vertex> nextVertByVal = new HashMap<>();
public Vertex(String name) {
this.name = name;
}
public Vertex addNext(String value) {
if (nextVertByVal.containsKey(value)) {
nextVertByVal.get(value).incrementGroupCount();
} else {
nextVertByVal.put(value, new Vertex(value));
}
return nextVertByVal.get(value);
}
public void incrementGroupCount() {
this.groupCount++;
}
public String getName() {
return name;
}
public int getGroupCount() {
return groupCount;
}
public Map<String, Vertex> getNextVertByVal() {
return nextVertByVal;
}
}
}
以下 class 处理输入数据的处理任务:它拆分行,负责丢弃可能发生的空字符串,并将输入打包到 Deque
以方便的方式适应双向迭代。
它还会实例化 graph 并管理它的工作。 GraphManager
负责向图形提供 分隔符 以便在创建结果映射时恢复字符串的初始形状。有了它,您可以在白色 space 上拆分给定的行,通过空字符串逐个字符或标点符号处理行,而无需更改这两个 classes.
中的一行
图形管理器
public class GraphManager {
private MultiGraph graph = new MultiGraph();
private String delimiter;
private GraphManager(String delimiter) {
this.delimiter = delimiter;
}
public static GraphManager getInstance(Iterable<String> lines, String delimiter) {
GraphManager gm = new GraphManager(delimiter);
gm.init(lines);
return gm;
}
private void init(Iterable<String> lines) {
for (String line: lines) {
Deque<String> names = new ArrayDeque<>();
for (String name: line.split(delimiter)) {
if (!name.isEmpty()) {
names.add(name);
}
}
addCluster(names);
}
}
private void addCluster(Deque<String> names) {
graph.addCluster(names);
}
public Map<String, Integer> getPrefixMap() {
return graph.generatePrefixMap(delimiter);
}
public Map<String, Integer> getSuffixMap() {
return graph.generateSuffixMap(delimiter);
}
}
main()
public static void main(String[] args) {
List<String> lines = List.of(
"Mary had a little lamb named Willy", "Mary had a little ham",
"Old McDonald had a farm named Willy", "Willy had a little dog named ham",
"( abc )", "( xyz )", "Visit Target Store", "Visit Walmart Store");
GraphManager gm = GraphManager.getInstance(lines, " ");
System.out.println("Prefixes:");
for (Map.Entry<String, Integer> entry: gm.getPrefixMap().entrySet()) {
System.out.println(entry.getValue() + " " + entry.getKey());
}
System.out.println("\nSuffixes:");
for (Map.Entry<String, Integer> entry: gm.getSuffixMap().entrySet()) {
System.out.println(entry.getValue() + " " + entry.getKey());
}
}
输出
Prefixes:
2 Mary had a little
2 Visit
2 (
Suffixes:
2 ham
2 )
2 Store
2 Willy named
要实现一个trie,首先你需要一个节点。
class Node<T> {
private final T value;
private final Node<T> parent;
private final Map<T, Node<T>> children;
private boolean isEnd;
Node(T value, Node<T> parent) {
this.value = value;
this.parent = parent;
this.children = new HashMap<>();
this.isEnd = false;
}
Node<T> addChild(T childValue, Node<T> parent) {
//return child node if existing, otherwise create and return
return this.children.computeIfAbsent(childValue, value -> new Node<>(value, parent));
}
T getValue() {
return this.value;
}
Node<T> getParent() {
return this.parent;
}
boolean isEnd() {
return this.isEnd;
}
void setEnd(boolean isEnd) {
this.isEnd = isEnd;
}
Collection<Node<T>> children() {
return this.children.values();
}
@Override
public String toString() {
//for easier debugging
return "Node{" +
"value=" + this.value +
", children=" + this.children.keySet() +
", isEnd=" + this.isEnd +
'}';
}
}
在节点中我们有实际值,对父节点的引用,以便于构建前缀,子节点将值映射到相应的节点,如果节点是结束节点,则有一个标志。
实际Trie
实施
public class Trie<T> {
private final Node<T> root;
public Trie() {
this.root = new Node<>(null, null);
}
public void insert(T[] elements) {
if (elements.length == 0) {
//don't want to set root as end node
return;
}
Node<T> currentNode = this.root;
for (T element : elements) {
currentNode = currentNode.addChild(element, currentNode);
}
currentNode.setEnd(true);
}
public Map<Collection<T>, Integer> countPrefixes(BiConsumer<Deque<T>, T> operation) {
Map<Collection<T>, Integer> map = new LinkedHashMap<>();
this.countPrefixes(this.root, map, operation);
return map;
}
private void countPrefixes(Node<T> current, Map<Collection<T>, Integer> map, BiConsumer<Deque<T>, T> operation) {
if (current != this.root) {
//check java doc for AbstractSet hashCode and equals
ArrayList<T> prefix = this.buildKey(current, operation);
int childrenCount = current.children().size();
if (childrenCount == 0 && current.isEnd()) {
//this sets entire collection(entire sentence 'Mary had a little lamb named Willy' for example)
//as prefix which is met once
childrenCount = 1;
}
map.merge(prefix, childrenCount, Integer::sum);
if (childrenCount > 1) {
//each parent node is already marked as prefix met once,
//but any node having more than one child means entire chain of nodes
//is a prefix met the number of children,
//so we go backwards to update parent counts with the difference
this.updateParentPrefixes(current.getParent(), childrenCount - 1, map, operation);
}
}
for (Node<T> child : current.children()) {
//visit each child recursively to count them
//depth first
this.countPrefixes(child, map, operation);
}
}
//operation is abstraction for the order
//in which we want to add elements
//when building key/prefix
private ArrayList<T> buildKey(Node<T> node, BiConsumer<Deque<T>, T> operation) {
Deque<T> prefix = new LinkedList<>();
while (node.getValue() != null) {
operation.accept(prefix, node.getValue());
node = node.getParent();
}
return new ArrayList<>(prefix);
}
private void updateParentPrefixes(Node<T> parent, int updateCount, Map<Collection<T>, Integer> map, BiConsumer<Deque<T>, T> operation) {
if (parent == this.root) {
//we don't want to update root, ever!!!
return;
}
ArrayList<T> prefix = this.buildKey(parent, operation);
map.merge(prefix, updateCount, Integer::sum);
//visit parents recursively until root
this.updateParentPrefixes(parent.getParent(), updateCount, map, operation);
}
}
在某种意义上简化了一些,只实现了insert
,但我们现在不需要update和delete。需要注意的几点:
- Trie 必须有一个 root 节点,它是空的。我们用
null
值和 null
父级表示。我们必须永不更新此节点。
countPrefixes()
的参数 BiConsumer<Deque<T>, T> operation
。这是我们在创建前缀时想要添加元素的顺序的抽象。它是必需的,因为在反转 collection/sentence. 时,计数后缀可以表示并实现为计数前缀
- Return 输入
Map<Collection<T>, Integer>
,共 countPrefixes()
。此实现更通用,因此每个前缀都表示为 节点值集合 .
- 为什么前缀是
ArrayList
?首先我们需要保持插入顺序。其次,它对 hashCode()
和 equals()
的实现使其成为映射键的良好候选者。引用 javadoc:
This ensures that list1.equals(list2) implies that list1.hashCode()==list2.hashCode() for any two lists, list1 and list2, as required by the general contract of Object.hashCode.
- 对于 hashCode。
In other words, two lists are defined to be equal if they contain the same elements in the same order.
- 等于。
- 代码中的注释应解释实施中的其余细节。
要测试的主线。
public class TrieMain {
public static void main(String[] args) {
String spacePattern = "\s+";
String emptyStringPattern = "";
ArrayList<String> strList = new ArrayList<>();
strList.add("Mary had a little lamb named Willy");
strList.add("Mary had a little ham");
strList.add("Mary had a sandwich");
strList.add("Old McDonald had a farm named Willy");
strList.add("Willy had a little dog named ham");
strList.add("Willy had a big dog named Willy");
strList.add("Willy had a huge dog named Willy");
strList.add("(abc)");
strList.add("(xyz)");
strList.add("Visit Target Store");
strList.add("Visit Walmart Store");
Trie<String> trie = new Trie<>();
//using another trie for suffix to avoid counting suffix as prefix and vice versa
Trie<String> reversedTrie = new Trie<>();
for (String string : strList) {
String pattern = string.startsWith("(") ? emptyStringPattern : spacePattern;
String[] words = string.split(pattern);
trie.insert(words);
//reverse collection to count suffixes
Collections.reverse(Arrays.asList(words));
reversedTrie.insert(words);
}
Map<Collection<String>, Integer> prefixCount = trie.countPrefixes(Deque::addFirst);
System.out.println("Prefixes:");
printMap(prefixCount);
System.out.println();
Map<Collection<String>, Integer> suffixCount = reversedTrie.countPrefixes(Deque::addLast);
System.out.println("Suffixes:");
printMap(suffixCount);
}
private static void printMap(Map<Collection<String>, Integer> map) {
map.entrySet()
.stream()
.filter(entry -> entry.getValue() > 1)
.forEach(e -> {
String key = String.join(" ", e.getKey());
String line = String.format("%s -> %d", key, e.getValue());
System.out.println(line);
});
}
}
这里需要注意的事项:
- 在列表中添加了更多句子以进行额外测试。
- 我将
(
的输入拆分为字符而不是单词作为其余部分以实现您的输出,但示例有些混乱。对于句子前缀是单词的集合,而对于这两个前缀是字符。
- 后缀需要第二个
trie
- 否则我们可能会将前缀算作后缀,反之亦然。
Arrays.asList()
返回的列表由输入数组支持。我正在利用它来反转输入数组本身的后缀。
编辑: 虽然使用实现的字符进行测试 AbstractSet,或者准确地说是任何集合,对于前缀来说是个坏主意,因为它不允许重复的元素。更新了 trie 实现以使用 ArrayList
。我还添加了 2 个主要方法,以演示 trie 使用字符串和字符:
- Strings - 每个节点是一个词,通过space
拆分字符串得到
public class TrieWords {
public static void main(String[] args) {
ArrayList<String> strList = new ArrayList<>();
strList.add("Mary had a little lamb named Willy");
strList.add("Mary had a little ham");
strList.add("Mary had a sandwich");
strList.add("Old McDonald had a farm named Willy");
strList.add("Willy had a little dog named ham");
strList.add("Willy had a big dog named Willy");
strList.add("Willy had a huge dog named Willy");
strList.add("(abc)");
strList.add("(xyz)");
strList.add("Visit Target Store");
strList.add("Visit Walmart Store");
Trie<String> trie = new Trie<>();
//using another trie for suffix to avoid counting suffix as prefix and vice versa
Trie<String> reversedTrie = new Trie<>();
for (String string : strList) {
String[] words = string.split("\s+");
trie.insert(words);
//reverse collection to count suffixes
Collections.reverse(Arrays.asList(words));
reversedTrie.insert(words);
}
Map<Collection<String>, Integer> prefixCount = trie.countPrefixes(Deque::addFirst);
System.out.println("Prefixes:");
printMap(prefixCount);
System.out.println();
Map<Collection<String>, Integer> suffixCount = reversedTrie.countPrefixes(Deque::addLast);
System.out.println("Suffixes:");
printMap(suffixCount);
}
private static void printMap(Map<Collection<String>, Integer> map) {
map.entrySet()
.stream()
.filter(entry -> entry.getValue() > 1)
.forEach(e -> {
String key = String.join(" ", e.getKey());
String line = String.format("%s -> %d", key, e.getValue());
System.out.println(line);
});
}
}
- 字符 - 每个节点都是字符串
中的一个字符
public class TrieChars {
public static void main(String[] args) {
ArrayList<String> strList = new ArrayList<>();
strList.add("Mary had a little lamb named Willy");
strList.add("Mary had a little ham");
strList.add("Mary had a sandwich");
strList.add("Old McDonald had a farm named Willy");
strList.add("Willy had a little dog named ham");
strList.add("Willy had a big dog named Willy");
strList.add("Willy had a huge dog named Willy");
strList.add("(abcd)");
strList.add("(xyz)");
strList.add("Visit Target Store");
strList.add("Visit Walmart Store");
Trie<Character> trie = new Trie<>();
//using another trie for suffix to avoid counting suffix as prefix and vice versa
Trie<Character> reversedTrie = new Trie<>();
for (String string : strList) {
int length = string.length();
Character[] words = new Character[length];
Character[] reversed = new Character[length];
for (int i = 0; i < length; i++) {
int reversedIndex = length - 1 - i;
words[i] = string.charAt(i);
reversed[reversedIndex] = string.charAt(i);
}
trie.insert(words);
reversedTrie.insert(reversed);
}
Map<Collection<Character>, Integer> prefixCount = trie.countPrefixes(Deque::addFirst);
System.out.println("Prefixes:");
printMap(prefixCount);
System.out.println();
Map<Collection<Character>, Integer> suffixCount = reversedTrie.countPrefixes(Deque::addLast);
System.out.println("Suffixes:");
printMap(suffixCount);
}
private static void printMap(Map<Collection<Character>, Integer> map) {
map.entrySet()
.stream()
.filter(entry -> entry.getValue() > 1)
.forEach(e -> {
String key = e.getKey().stream().map(Object::toString).collect(Collectors.joining(""));
String line = String.format("%s -> %d", key, e.getValue());
System.out.println(line);
});
}
}
给定一个字符串列表:
ArrayList<String> strList = new ArrayList<String>();
strList.add("Mary had a little lamb named Willy");
strList.add("Mary had a little ham");
strList.add("Old McDonald had a farm named Willy");
strList.add("Willy had a little dog named ham");
strList.add("(abc)");
strList.add("(xyz)");
strList.add("Visit Target Store");
strList.add("Visit Walmart Store");
这应该以 HashMap<String, Integer>
prefixMap
和 suffixMap
:
前缀:
Mary had a -> 2
Mary had a little -> 2
( -> 2
Visit -> 2
后缀:
named Willy -> 2
ham -> 2
) -> 2
Store -> 2
到目前为止,我可以使用以下代码生成一个出现在列表中所有项目中的前缀:
public static final int INDEX_NOT_FOUND = -1;
public static String getAllCommonPrefixesInList(final String... strs) {
if (strs == null || strs.length == 0) {
return EMPTY_STRING;
}
final int smallestIndexOfDiff = getIndexOfDifference(strs);
if (smallestIndexOfDiff == INDEX_NOT_FOUND) {
// All Strings are identical
if (strs[0] == null) {
return EMPTY_STRING;
}
return strs[0];
} else if (smallestIndexOfDiff == 0) {
// No common initial characters found, return empty String
return EMPTY_STRING;
} else {
// Common initial character sequence found, return sequence
return strs[0].substring(0, smallestIndexOfDiff);
}
}
public static int getIndexOfDifference(final CharSequence... charSequence) {
if (charSequence == null || charSequence.length <= 1) {
return INDEX_NOT_FOUND;
}
boolean isAnyStringNull = false;
boolean areAllStringsNull = true;
final int arrayLen = charSequence.length;
int shortestStrLen = Integer.MAX_VALUE;
int longestStrLen = 0;
// Find the min and max string lengths - avoids having to check that we are not exceeding the length of the string each time through the bottom loop.
for (int i = 0; i < arrayLen; i++) {
if (charSequence[i] == null) {
isAnyStringNull = true;
shortestStrLen = 0;
} else {
areAllStringsNull = false;
shortestStrLen = Math.min(charSequence[i].length(), shortestStrLen);
longestStrLen = Math.max(charSequence[i].length(), longestStrLen);
}
}
// Deals with lists containing all nulls or all empty strings
if (areAllStringsNull || longestStrLen == 0 && !isAnyStringNull) {
return INDEX_NOT_FOUND;
}
// Handle lists containing some nulls or some empty strings
if (shortestStrLen == 0) {
return 0;
}
// Find the position with the first difference across all strings
int firstDiff = -1;
for (int stringPos = 0; stringPos < shortestStrLen; stringPos++) {
final char comparisonChar = charSequence[0].charAt(stringPos);
for (int arrayPos = 1; arrayPos < arrayLen; arrayPos++) {
if (charSequence[arrayPos].charAt(stringPos) != comparisonChar) {
firstDiff = stringPos;
break;
}
}
if (firstDiff != -1) {
break;
}
}
if (firstDiff == -1 && shortestStrLen != longestStrLen) {
// We compared all of the characters up to the length of the
// shortest string and didn't find a match, but the string lengths
// vary, so return the length of the shortest string.
return shortestStrLen;
}
return firstDiff;
}
但是,我的目标是包含任何 prefix/suffix,结果地图中至少出现 2+
次。
Java
如何实现?
使用 trie.
应该很容易解决这个问题trie 节点基本上应该跟踪两件事:
- 子节点
- 以当前节点结束的前缀数
在 trie 中插入所有字符串,这将在 O(string length * number of strings)
中完成。之后,只需遍历 trie,您就可以根据您的用例根据计数对前缀进行哈希处理。对于后缀,您可以使用相同的方法,只需开始以相反的顺序遍历字符串即可。
编辑:
转念一想,trie 可能是最有效的方法,但一个简单的 hashmap 实现也应该在这里起作用。下面是生成所有计数 > 1 的前缀的示例。
import java.util.*;
import java.util.stream.*;
class Main {
public static void main(String[] args) {
System.out.println("Hello world!");
ArrayList<String> strList = new ArrayList<String>();
strList.add("Mary had a little lamb named Willy");
strList.add("Mary had a little ham");
strList.add("Old McDonald had a farm named Willy");
strList.add("Willy had a little dog named ham");
strList.add("(abc)");
strList.add("(xyz)");
strList.add("Visit Target Store");
strList.add("Visit Walmart Store");
Map<String, Integer> prefixMap = new HashMap<String, Integer>();
ArrayList<String> stringsWithHighestOccurrence = new ArrayList<String>();
for (String word : strList) {
for (int i = 1; i <= word.length(); i++){
String prefix = word.substring(0, i);
prefixMap.merge(prefix, 1, Integer::sum);
}
}
Integer maxval = Collections.max(prefixMap.values());
for (String key: prefixMap.keySet()){
Integer value = prefixMap.get(key);
if (value > 1) System.out.println(key + " : " + value);
if (value == maxval) stringsWithHighestOccurrence.add(key);
}
int maxLength = stringsWithHighestOccurrence.stream().map(String::length).max(Integer::compareTo).get();
System.out.println(maxLength);
ArrayList<String> prefixesWithMaxLength =
stringsWithHighestOccurrence.stream().filter(c -> c.length() == maxLength).collect(Collectors.toCollection(ArrayList::new));
System.out.println(prefixesWithMaxLength);
}
}
尽管如此,为了完成,我还将添加一个基本的 TrieNode
实现,因为我的回答首先提出了这种方法。
TrieNode
:
class TrieNode {
private final Map<Character, TrieNode> children = new HashMap<>();
private int count;
Map<Character, TrieNode> getChildren() {
return children;
}
boolean getCount() {
return count;
}
void increaseCount() {
this.count += 1;
}
}
Trie
:
class Trie {
private TrieNode root;
Trie() {
root = new TrieNode();
}
void insert(String word) {
TrieNode current = root;
for (char l : word.toCharArray()) {
current = current.getChildren().computeIfAbsent(l, c -> new TrieNode());
current.increaseCount()
}
}
}
遍历 trie 类似于一个简单的 DFS 场景,其中还维护了到当前节点的“路径”(我们正在切换带有前缀的路径到这一点)。
我认为@Abhinav 提供的解决方案应该可以使用 HashMap。在这里,我将 post 使用 java 中的简单 Trie 实现的解决方案(进行一些自定义,例如将 freq 添加到 Trie 节点中)。
ArrayList<String> strList = new ArrayList<String>();
strList.add("Mary had a little lamb named Willy");
strList.add("Mary had a little ham");
strList.add("Old McDonald had a farm named Willy");
strList.add("Willy had a little dog named ham");
strList.add("( abc )");
strList.add("( xyz )");
strList.add("Visit Target Store");
strList.add("Visit Walmart Store");
TNode root = new TNode("");
int maxFreq = 1;
for(String sentence : strList) {
TNode currentNode = root;
String[] words = sentence.split(" "); // Assuming space character is the delimiter
for(String word: words) {
if(currentNode.children.containsKey(word)) {
currentNode.children.get(word).freq += 1;
maxFreq = Math.max(maxFreq, currentNode.children.get(word).freq);
} else {
TNode c = new TNode(word);
c.freq = 1;
currentNode.children.put(word, c);
}
currentNode = currentNode.children.get(word);
}
}
Map<String, Integer> result = new HashMap<String, Integer>();
Queue<NodeWithPrefix> queue = new LinkedList<NodeWithPrefix>();
for(TNode node : root.children.values()){
NodeWithPrefix nwp = new NodeWithPrefix(node);
nwp.prefix = "";
queue.add(nwp);
}
while(!queue.isEmpty()) {
NodeWithPrefix item = queue.poll();
if(item.node.freq == maxFreq) {
result.put(item.prefix + " " + item.node.value, item.node.freq);
}
for(TNode child : item.node.children.values()) {
NodeWithPrefix nwp = new NodeWithPrefix(child);
nwp.prefix = item.prefix + " " + item.node.value;
queue.add(nwp);
}
}
return result;
此算法需要另外 2 个 classes:
class NodeWithPrefix {
String prefix;
TNode node;
public NodeWithPrefix(TNode node){
this.node = node;
}
}
class TNode {
String value;
int freq = 0;
Map<String, TNode> children;
public TNode(String value){
this.value = value;
children = new HashMap<String, TNode>();
}
}
输出是for prefix:(for postfix应该类似,只需要向后构建Trie)
{ Mary had=2, Mary had a=2, Visit=2, (=2, Mary had a little=2, Mary=2}
这里我使用 BFS 来检索 Trie 中频率等于 maxFreq 的所有子字符串。我们可以根据需要调整过滤条件。您也可以在这里进行 DFS。 其他考虑是我们可以将前缀添加到 TNode class 本身,我更喜欢将它单独放在另一个 class.
中根据我对这个问题的理解,最适合解决它的数据结构是非循环不相交的 Graph.
在一般情况下,一个图将由几个不相连的簇组成。每个簇都会有一个tree-like结构,在边缘情况下它会形成一个链表.
基本上,解决这个问题的最简单朴素方法是根据每一行创建一堆链表 ,并遍历它们。缺点是:重复节点(更大的内存消耗),更大的time-complexity(需要更多的操作)并且更多error-prone 因为需要更多的手动操作。
图表的描述
所以我会坚持使用 graph 作为这个问题的数据结构,并尽量让事情尽可能简单。
让我们考虑以下输入:
"Mary had a little lamb named Willy"
"Mary had a little ham"
"A B C"
图形的图形表示将如下所示;
前两行将构成一个簇,由链表(头部)和组成tree(尾部)。第二个簇将由一个链表表示,它的顶点不与由其他字符串形成的顶点相连。
这不是构造顶点的唯一方式,头部可以产生一个 N-tree 并且可以在中间某处观察到链表。
要点是,为了解决问题,我们需要通过所有分支跟踪顶点链,直到 顶点重叠 。在 graph 的这些部分中,两条或多条线之间共有的每个 prefix-strings 和 suffix-string 将由 单个顶点表示 (节点).
为了维护映射到特定 顶点 的字符串的数量,每个顶点都应该有一个变量(int groupCount
在下面的代码中),当创建 顶点 时,它被分配默认值 1
并且每次新字符串映射到该顶点时递增。
每个 顶点 都包含一张地图,其中包含对其 邻居 的引用。添加新的 neighbour-vertex 时,根据给定的字符串创建新的 Vertex
或 count现有的 顶点 递增。
为了符合这个任务,图表应该保持对所有head-vertexes和[=89=的引用]tail-vertexes。为了简单起见,不是维护两组对相邻节点的引用,而是在每个顶点中维护两个单独的计数变量(因为suffix-count和prefix-count会不同),这个解决方案图实际上由两个图组成(suffix-graph和prefix-graph)。出于这个原因,实现 class 命名为 MultiGraph
.
为了用顶点[=填充suffix-graph和prefix-graph 153=],方法 addCluster()
通过 Iterator
以正常或相反的顺序遍历给定行的字符串,具体取决于正在填充哪个 graph .
深度优先搜索
图形填充后的下一步是生成频率为 2
和更大的字符串映射。
为此,正在使用 classical depth first search 算法。
为了实现 DFS,需要一个将用作 堆栈 的可变容器(ArrayDeque
正用于该目的)。从 heads/tails 的 map 中取出的第一个元素将被放置在 stack 的顶部,并且 StringBuilder
持有这个元素的名称将被放置在地图中。
然后,要恢复具有特定 count 的字符串,将从 的顶部弹出 vertexes stack 和它们的 neighbors 和 count > 1
依次将被放置在堆栈的顶部。 current prefix 的副本加上分隔符和邻居的名字将被映射到 neighbour-vertex.
如果计数发生变化,则表示当前前缀表示至少两行之间的最长公共字符串。在这种情况下,前缀和计数被添加到生成的映射中。
实施
下面的实现由两个class组成,分别是narrow-focused和self-contained . MultiGraph
class 专门用作数据结构,维护两个图。 pluming代码,像拆分字符串的行一样,被提取到一个单独的class GraphManager
.
图表
public class MultiGraph {
private final Map<String, Vertex> heads = new HashMap<>();
private final Map<String, Vertex> tails = new HashMap<>();
public void addCluster(Deque<String> names) {
addCluster(heads, names.iterator());
addCluster(tails, names.descendingIterator());
}
private void addCluster(Map<String, Vertex> clusters, Iterator<String> names) {
String rootName = names.next();
if (clusters.containsKey(rootName)) {
clusters.get(rootName).incrementGroupCount();
} else {
clusters.put(rootName, new Vertex(rootName));
}
Vertex current = clusters.get(rootName);
while (names.hasNext()) {
current = current.addNext(names.next());
}
}
public Map<String, Integer> generatePrefixMap(String delimiter) {
Map<String, Integer> countByPrefix = new HashMap<>();
for (Vertex next: heads.values()) {
if (next.getGroupCount() == 1) {
continue;
}
performDFS(heads, countByPrefix, delimiter, next);
}
return countByPrefix;
}
public Map<String, Integer> generateSuffixMap(String delimiter) {
Map<String, Integer> countBySuffix = new HashMap<>();
for (Vertex next: tails.values()) {
if (next.getGroupCount() == 1) {
continue;
}
performDFS(tails, countBySuffix, delimiter, next);
}
return countBySuffix;
}
// implementation of the Depth First Search algorithm
public void performDFS(Map<String, Vertex> clusters,
Map<String, Integer> countByPrefix,
String delimiter, Vertex next) {
StringBuilder prefix = null;
Vertex current = next;
int count = next.getGroupCount();
Deque<Vertex> stack = new ArrayDeque<>(); // create as stack
Map<Vertex, StringBuilder> prefixByVert = new HashMap<>();
stack.push(next); // place the first element on the stack
prefixByVert.put(current, new StringBuilder(current.getName()));
while (!stack.isEmpty()) {
current = stack.pop();
if (current.getGroupCount() < count) { // the number of strings mapped to the current Vertex has been changed
countByPrefix.put(prefix.toString(), count); // saving the result
count = current.getGroupCount();
}
prefix = prefixByVert.get(current);
for (Vertex neighbour: current.getNextVertByVal().values()) {
if (next.getGroupCount() == 1) {
continue;
}
stack.push(neighbour);
prefixByVert.put(neighbour, new StringBuilder(prefix)
.append(delimiter)
.append(neighbour.getName()));
}
}
if (prefix != null && count > 1) {
countByPrefix.putIfAbsent(prefix.toString(), count);
}
}
private static class Vertex {
private final String name;
private int groupCount = 1;
private final Map<String, Vertex> nextVertByVal = new HashMap<>();
public Vertex(String name) {
this.name = name;
}
public Vertex addNext(String value) {
if (nextVertByVal.containsKey(value)) {
nextVertByVal.get(value).incrementGroupCount();
} else {
nextVertByVal.put(value, new Vertex(value));
}
return nextVertByVal.get(value);
}
public void incrementGroupCount() {
this.groupCount++;
}
public String getName() {
return name;
}
public int getGroupCount() {
return groupCount;
}
public Map<String, Vertex> getNextVertByVal() {
return nextVertByVal;
}
}
}
以下 class 处理输入数据的处理任务:它拆分行,负责丢弃可能发生的空字符串,并将输入打包到 Deque
以方便的方式适应双向迭代。
它还会实例化 graph 并管理它的工作。 GraphManager
负责向图形提供 分隔符 以便在创建结果映射时恢复字符串的初始形状。有了它,您可以在白色 space 上拆分给定的行,通过空字符串逐个字符或标点符号处理行,而无需更改这两个 classes.
图形管理器
public class GraphManager {
private MultiGraph graph = new MultiGraph();
private String delimiter;
private GraphManager(String delimiter) {
this.delimiter = delimiter;
}
public static GraphManager getInstance(Iterable<String> lines, String delimiter) {
GraphManager gm = new GraphManager(delimiter);
gm.init(lines);
return gm;
}
private void init(Iterable<String> lines) {
for (String line: lines) {
Deque<String> names = new ArrayDeque<>();
for (String name: line.split(delimiter)) {
if (!name.isEmpty()) {
names.add(name);
}
}
addCluster(names);
}
}
private void addCluster(Deque<String> names) {
graph.addCluster(names);
}
public Map<String, Integer> getPrefixMap() {
return graph.generatePrefixMap(delimiter);
}
public Map<String, Integer> getSuffixMap() {
return graph.generateSuffixMap(delimiter);
}
}
main()
public static void main(String[] args) {
List<String> lines = List.of(
"Mary had a little lamb named Willy", "Mary had a little ham",
"Old McDonald had a farm named Willy", "Willy had a little dog named ham",
"( abc )", "( xyz )", "Visit Target Store", "Visit Walmart Store");
GraphManager gm = GraphManager.getInstance(lines, " ");
System.out.println("Prefixes:");
for (Map.Entry<String, Integer> entry: gm.getPrefixMap().entrySet()) {
System.out.println(entry.getValue() + " " + entry.getKey());
}
System.out.println("\nSuffixes:");
for (Map.Entry<String, Integer> entry: gm.getSuffixMap().entrySet()) {
System.out.println(entry.getValue() + " " + entry.getKey());
}
}
输出
Prefixes:
2 Mary had a little
2 Visit
2 (
Suffixes:
2 ham
2 )
2 Store
2 Willy named
要实现一个trie,首先你需要一个节点。
class Node<T> {
private final T value;
private final Node<T> parent;
private final Map<T, Node<T>> children;
private boolean isEnd;
Node(T value, Node<T> parent) {
this.value = value;
this.parent = parent;
this.children = new HashMap<>();
this.isEnd = false;
}
Node<T> addChild(T childValue, Node<T> parent) {
//return child node if existing, otherwise create and return
return this.children.computeIfAbsent(childValue, value -> new Node<>(value, parent));
}
T getValue() {
return this.value;
}
Node<T> getParent() {
return this.parent;
}
boolean isEnd() {
return this.isEnd;
}
void setEnd(boolean isEnd) {
this.isEnd = isEnd;
}
Collection<Node<T>> children() {
return this.children.values();
}
@Override
public String toString() {
//for easier debugging
return "Node{" +
"value=" + this.value +
", children=" + this.children.keySet() +
", isEnd=" + this.isEnd +
'}';
}
}
在节点中我们有实际值,对父节点的引用,以便于构建前缀,子节点将值映射到相应的节点,如果节点是结束节点,则有一个标志。
实际Trie
实施
public class Trie<T> {
private final Node<T> root;
public Trie() {
this.root = new Node<>(null, null);
}
public void insert(T[] elements) {
if (elements.length == 0) {
//don't want to set root as end node
return;
}
Node<T> currentNode = this.root;
for (T element : elements) {
currentNode = currentNode.addChild(element, currentNode);
}
currentNode.setEnd(true);
}
public Map<Collection<T>, Integer> countPrefixes(BiConsumer<Deque<T>, T> operation) {
Map<Collection<T>, Integer> map = new LinkedHashMap<>();
this.countPrefixes(this.root, map, operation);
return map;
}
private void countPrefixes(Node<T> current, Map<Collection<T>, Integer> map, BiConsumer<Deque<T>, T> operation) {
if (current != this.root) {
//check java doc for AbstractSet hashCode and equals
ArrayList<T> prefix = this.buildKey(current, operation);
int childrenCount = current.children().size();
if (childrenCount == 0 && current.isEnd()) {
//this sets entire collection(entire sentence 'Mary had a little lamb named Willy' for example)
//as prefix which is met once
childrenCount = 1;
}
map.merge(prefix, childrenCount, Integer::sum);
if (childrenCount > 1) {
//each parent node is already marked as prefix met once,
//but any node having more than one child means entire chain of nodes
//is a prefix met the number of children,
//so we go backwards to update parent counts with the difference
this.updateParentPrefixes(current.getParent(), childrenCount - 1, map, operation);
}
}
for (Node<T> child : current.children()) {
//visit each child recursively to count them
//depth first
this.countPrefixes(child, map, operation);
}
}
//operation is abstraction for the order
//in which we want to add elements
//when building key/prefix
private ArrayList<T> buildKey(Node<T> node, BiConsumer<Deque<T>, T> operation) {
Deque<T> prefix = new LinkedList<>();
while (node.getValue() != null) {
operation.accept(prefix, node.getValue());
node = node.getParent();
}
return new ArrayList<>(prefix);
}
private void updateParentPrefixes(Node<T> parent, int updateCount, Map<Collection<T>, Integer> map, BiConsumer<Deque<T>, T> operation) {
if (parent == this.root) {
//we don't want to update root, ever!!!
return;
}
ArrayList<T> prefix = this.buildKey(parent, operation);
map.merge(prefix, updateCount, Integer::sum);
//visit parents recursively until root
this.updateParentPrefixes(parent.getParent(), updateCount, map, operation);
}
}
在某种意义上简化了一些,只实现了insert
,但我们现在不需要update和delete。需要注意的几点:
- Trie 必须有一个 root 节点,它是空的。我们用
null
值和null
父级表示。我们必须永不更新此节点。 countPrefixes()
的参数BiConsumer<Deque<T>, T> operation
。这是我们在创建前缀时想要添加元素的顺序的抽象。它是必需的,因为在反转 collection/sentence. 时,计数后缀可以表示并实现为计数前缀
- Return 输入
Map<Collection<T>, Integer>
,共countPrefixes()
。此实现更通用,因此每个前缀都表示为 节点值集合 . - 为什么前缀是
ArrayList
?首先我们需要保持插入顺序。其次,它对hashCode()
和equals()
的实现使其成为映射键的良好候选者。引用 javadoc:
This ensures that list1.equals(list2) implies that list1.hashCode()==list2.hashCode() for any two lists, list1 and list2, as required by the general contract of Object.hashCode.
- 对于 hashCode。
In other words, two lists are defined to be equal if they contain the same elements in the same order.
- 等于。
- 代码中的注释应解释实施中的其余细节。
要测试的主线。
public class TrieMain {
public static void main(String[] args) {
String spacePattern = "\s+";
String emptyStringPattern = "";
ArrayList<String> strList = new ArrayList<>();
strList.add("Mary had a little lamb named Willy");
strList.add("Mary had a little ham");
strList.add("Mary had a sandwich");
strList.add("Old McDonald had a farm named Willy");
strList.add("Willy had a little dog named ham");
strList.add("Willy had a big dog named Willy");
strList.add("Willy had a huge dog named Willy");
strList.add("(abc)");
strList.add("(xyz)");
strList.add("Visit Target Store");
strList.add("Visit Walmart Store");
Trie<String> trie = new Trie<>();
//using another trie for suffix to avoid counting suffix as prefix and vice versa
Trie<String> reversedTrie = new Trie<>();
for (String string : strList) {
String pattern = string.startsWith("(") ? emptyStringPattern : spacePattern;
String[] words = string.split(pattern);
trie.insert(words);
//reverse collection to count suffixes
Collections.reverse(Arrays.asList(words));
reversedTrie.insert(words);
}
Map<Collection<String>, Integer> prefixCount = trie.countPrefixes(Deque::addFirst);
System.out.println("Prefixes:");
printMap(prefixCount);
System.out.println();
Map<Collection<String>, Integer> suffixCount = reversedTrie.countPrefixes(Deque::addLast);
System.out.println("Suffixes:");
printMap(suffixCount);
}
private static void printMap(Map<Collection<String>, Integer> map) {
map.entrySet()
.stream()
.filter(entry -> entry.getValue() > 1)
.forEach(e -> {
String key = String.join(" ", e.getKey());
String line = String.format("%s -> %d", key, e.getValue());
System.out.println(line);
});
}
}
这里需要注意的事项:
- 在列表中添加了更多句子以进行额外测试。
- 我将
(
的输入拆分为字符而不是单词作为其余部分以实现您的输出,但示例有些混乱。对于句子前缀是单词的集合,而对于这两个前缀是字符。 - 后缀需要第二个
trie
- 否则我们可能会将前缀算作后缀,反之亦然。 Arrays.asList()
返回的列表由输入数组支持。我正在利用它来反转输入数组本身的后缀。
编辑: 虽然使用实现的字符进行测试 AbstractSet,或者准确地说是任何集合,对于前缀来说是个坏主意,因为它不允许重复的元素。更新了 trie 实现以使用 ArrayList
。我还添加了 2 个主要方法,以演示 trie 使用字符串和字符:
- Strings - 每个节点是一个词,通过space 拆分字符串得到
public class TrieWords {
public static void main(String[] args) {
ArrayList<String> strList = new ArrayList<>();
strList.add("Mary had a little lamb named Willy");
strList.add("Mary had a little ham");
strList.add("Mary had a sandwich");
strList.add("Old McDonald had a farm named Willy");
strList.add("Willy had a little dog named ham");
strList.add("Willy had a big dog named Willy");
strList.add("Willy had a huge dog named Willy");
strList.add("(abc)");
strList.add("(xyz)");
strList.add("Visit Target Store");
strList.add("Visit Walmart Store");
Trie<String> trie = new Trie<>();
//using another trie for suffix to avoid counting suffix as prefix and vice versa
Trie<String> reversedTrie = new Trie<>();
for (String string : strList) {
String[] words = string.split("\s+");
trie.insert(words);
//reverse collection to count suffixes
Collections.reverse(Arrays.asList(words));
reversedTrie.insert(words);
}
Map<Collection<String>, Integer> prefixCount = trie.countPrefixes(Deque::addFirst);
System.out.println("Prefixes:");
printMap(prefixCount);
System.out.println();
Map<Collection<String>, Integer> suffixCount = reversedTrie.countPrefixes(Deque::addLast);
System.out.println("Suffixes:");
printMap(suffixCount);
}
private static void printMap(Map<Collection<String>, Integer> map) {
map.entrySet()
.stream()
.filter(entry -> entry.getValue() > 1)
.forEach(e -> {
String key = String.join(" ", e.getKey());
String line = String.format("%s -> %d", key, e.getValue());
System.out.println(line);
});
}
}
- 字符 - 每个节点都是字符串 中的一个字符
public class TrieChars {
public static void main(String[] args) {
ArrayList<String> strList = new ArrayList<>();
strList.add("Mary had a little lamb named Willy");
strList.add("Mary had a little ham");
strList.add("Mary had a sandwich");
strList.add("Old McDonald had a farm named Willy");
strList.add("Willy had a little dog named ham");
strList.add("Willy had a big dog named Willy");
strList.add("Willy had a huge dog named Willy");
strList.add("(abcd)");
strList.add("(xyz)");
strList.add("Visit Target Store");
strList.add("Visit Walmart Store");
Trie<Character> trie = new Trie<>();
//using another trie for suffix to avoid counting suffix as prefix and vice versa
Trie<Character> reversedTrie = new Trie<>();
for (String string : strList) {
int length = string.length();
Character[] words = new Character[length];
Character[] reversed = new Character[length];
for (int i = 0; i < length; i++) {
int reversedIndex = length - 1 - i;
words[i] = string.charAt(i);
reversed[reversedIndex] = string.charAt(i);
}
trie.insert(words);
reversedTrie.insert(reversed);
}
Map<Collection<Character>, Integer> prefixCount = trie.countPrefixes(Deque::addFirst);
System.out.println("Prefixes:");
printMap(prefixCount);
System.out.println();
Map<Collection<Character>, Integer> suffixCount = reversedTrie.countPrefixes(Deque::addLast);
System.out.println("Suffixes:");
printMap(suffixCount);
}
private static void printMap(Map<Collection<Character>, Integer> map) {
map.entrySet()
.stream()
.filter(entry -> entry.getValue() > 1)
.forEach(e -> {
String key = e.getKey().stream().map(Object::toString).collect(Collectors.joining(""));
String line = String.format("%s -> %d", key, e.getValue());
System.out.println(line);
});
}
}