将 recurring/duplicate 模式识别为来自父数组的子数组

Question

我有一个典型的模式搜索问题，我需要确定多个模式在数组中出现的位置并将它们挑出来。

例如：['horse', 'camel', 'horse', 'camel', 'tiger', 'horse', 'camel', 'horse', 'camel']

函数应该return

['horse', 'camel'], 
['horse', 'camel', 'horse'],
['camel', 'horse', 'camel'],
['horse', 'camel', 'horse', 'camel']

即查找可以成为子数组的数组中重复的模式，

或者另一种定义方式是 -> 查找所有在主数组中出现超过 1 次的子数组。

即结果数组应该有 length > 1 ->

[1, 2, 3, 1, 2, 1, 4, 5] => [1,2,3] 和 [1,4,5] 都是子数组但是 [1,2,3] 是 recurring/repeating 子数组 NOT [1,4,5]

寻找合适的高效算法而不是强力循环解决方案。

Answer 1

这可能不是您想要的，但我不知道您已经尝试过什么，所以它可能会有用。这是我的直接方法，可能属于您的 "brute-force looping solutions" 但我想试一试，因为没有人发布完整的答案。

在java中：

// use this to not add duplicates to list
static boolean contains (List<String[]> patterns, String[] pattern){
    for(String[] s: patterns)
        if (Arrays.equals(pattern,s)) return true;
    return false;
}


/**
 *
 * @param str String array containing all elements in your set
 * @param start index of subarray
 * @param end index of subarray
 * @return if subarray is a recurring pattern
 */
static boolean search (String[] str,int start,int end) {
    // length of pattern
    int len = end - start + 1;

    // how many times you want pattern to
    // appear in text
    int n = 1;

    // increment m if pattern is matched
    int m = 0;

    // shift pattern down the array
    for (int i = end+1; i <= str.length - len; i++) {
        int j;
        for (j = 0; j < len; j++) {
            if (!str[i + j].equals(str[start + j]))
                break;
        }

        // if pattern is matched at [i to i+len]
        if (j == len) {
            m++;
            if (m == n) return true;
        }
    }
    return false;
}


/**
 *
 * @param str String array containing all elements in your set
 * @return a list of subsets of input set which are a recurring pattern
 */
static List<String[]> g (String[] str) {
    // put patterns in here
    List<String[]> patterns = new ArrayList<>();

    // iterate through all possible subarrays in str
    for(int i = 0; i < str.length-1; i++){
        for(int j = i + 1; j < str.length; j++){

            // if a pattern is found
            if (search(str,i,j)) {
                int len = j-i+1;
                String[] subarray = new String[len];
                System.arraycopy(str,i,subarray,0,len);
                if (!contains(patterns,subarray))
                    patterns.add(subarray);

            }
        }
    }
    return patterns;
}

public static void main(String[] args) {

    String[] str = {"horse", "camel", "horse", "camel", "tiger",
                    "horse", "camel", "horse", "camel"};
    // print out
    List<String[]> patterns = g(str);
    for (String[] s: patterns)
        System.out.println(Arrays.toString(s));
}

输出：

[horse, camel]
[horse, camel, horse]
[horse, camel, horse, camel]
[camel, horse]
[camel, horse, camel]

正如我发表的评论中提到的：

"would [camel, horse] be included in the output?"

我的输出与此一致，因为在索引 [1-2] 和 [6-7] 处有 2 个 [camel, horse] 实例。但也许我完全误解了你的问题并且我不理解这些限制。

至于优化，例如 search(...) 方法只是一个简单的子字符串搜索，还有一些更优化的方法可以做到这一点，例如Knuth–Morris–Pratt。抱歉，如果这正是您不想要的，但也许有一些用处

将 recurring/duplicate 模式识别为来自父数组的子数组

identify recurring/duplicate patterns as sub-arrays from a parent array

language-agnostic

arrays

algorithm

pattern-recognition

sub-array