与语言无关：检查字符串是否由某个子字符串的倍数组成

Question

我想要通用算法来查找字符串是否包含重复模式，并且字符串的任何部分都不会遗漏在重复模式之外。

例如，查看这些示例字符串：

abcabcabc - true
abcabcabcx - false
cucumbercucumber - true
cucumber - false
abaaabaaabaa - true

我查看了 this answer，它解决了一些情况下的问题，但在 cucumber 示例中会失败。我需要在所有情况下都适用的东西。

Answer 1

这似乎是显而易见的方法：

String s = "abaaabaabaa" ; // string to test

for (int repeating_pattern_length=1; 
     repeating_pattern_length<=s.length/2;
     repeating_pattern_length++)
{  if (modulo(s.length,repeating_pattern_length)==0)
   { // can fit exactly N times
     String proposed_subpattern=s.substring(0,repeating_pattern_length);
     for (nth_instance=2; // don't need to check 1st occurrence
          nth_instance<=s.length/repeating_pattern_length;
          nth_instance++)
     { // check nth occurrence
       if (!proposed_subpattern.equal(
           s.substring((nth_instance-1)*repeating_pattern_length,
                       repeating_pattern_length)
          cycle repeating_pattern_length; // nth occurrence doesn't match
     }
     return true;
   }
}
return false;

[未经测试。这是 Java，但我不是 Java 编码专家。请原谅我的冒犯。

这可以说具有复杂度 O(s.length) 和一个小的常数因子。

人们可能会考虑构建一个后缀树（也是线性时间），然后检查树是否具有适当的循环。我怀疑上述算法在实践中相当不错。

Answer 2

由于您没有要求特定语言，我建议您查看 Repeating String 的 Rosetta 代码页面。您可以找到并研究一堆解决问题的算法。 尽管 Rosetta 代码中针对 1 和 0 提出了问题，但大多数解决方案应该适用于任何可能的字符串。

我写了一个通用的Common Lisp递归解决方案，这里是注释代码：

(ql:quickload :alexandria)
(defun rep-stringv (a-str &optional (max-rotation (floor (/ (length a-str) 2))))
  ;; Exit condition if no repetition found.
  (cond ((< max-rotation 1) "Not a repeating string")
        ;; Two checks:
        ;; 1. Truncated string must be equal to rotation by repetion size.
        ;; 2. Remaining chars (rest-str) are identical to starting chars (beg-str)
        ((let* ((trunc (* max-rotation (truncate (length a-str) max-rotation)))
                (truncated-str (subseq a-str 0 trunc))
                (rest-str (subseq a-str trunc))
                (beg-str (subseq a-str 0 (rem (length a-str) max-rotation))))
           (and (string= beg-str rest-str)
                (string= (alexandria:rotate (copy-seq truncated-str) max-rotation)
                         truncated-str)))
         ;; If both checks pass, return the repeting string.
         (subseq a-str 0 max-rotation))
        ;; Recurse function reducing length of rotation.
        (t (rep-stringv a-str (1- max-rotation)))))

测试：

CL-USER> (rep-stringv "cucumber")
"Not a repeating string"
CL-USER> (rep-stringv "abaaabaaabaa")
"abaa"

最好的解决方案可以通过 suffix tree for the string, as you probably already now - since it's a common problem described everywhere, e.g., Wikipedia 实现。

对我来说实施它似乎有些过分，除非您真的需要性能。无论如何，可以找到后缀树的例子（在许多语言中）here。

Answer 3

下面是一些完成这项工作的基本 C++ 代码：

bool IsRepeating( std::string in ) {

    int totalLength = in.length();
    for (int subLength = 1; subLength <= totalLength / 2; subLength++ ) {
        if (totalLength % subLength != 0) continue;

        for (int startPos = 0; startPos < subLength; startPos++) {
            char startChar =in[startPos];
            bool mismatchFound = false;
            for (int delta = subLength; delta < totalLength-startPos; delta += subLength) {
                if (in[startPos+delta] != startChar ) {
                    mismatchFound = true;
                    break;
                }
            }
            if (mismatchFound) {
                break;
            }
            return true;
        }
    }
    return false;
}

它利用了子字符串长度必须是总字符串长度的约数这一事实。

最坏情况下的时间复杂度非常糟糕，类似于 O(n^2 log(log(n)))，但我不确定。（最坏的情况是当字符串恰好由两个相同的子字符串组成时。）我仍然相信平均来说它应该表现得很好，因为大部分外部循环体只针对字符串长度的除数执行，并且内部循环会尽快中止因为发现不匹配。

编辑：@Veedrac 的解决方案不仅更优雅，而且在大多数情况下性能更高。为了直接比较，这里是 C++ 版本：

bool IsRepeating( const std::string& in ) {
    if (in.length() < 1) return false;
    return (in + in).substr(1, 2 * in.length() - 2).find(in) != std::string::npos;
}

但是它确实使用了更多的内存。如果您不知道函数的用途，可能很难理解。但这也适用于我的原始版本。

Answer 4

受启发的Python解决方案是

s in (s + s)[1:-1]

假设 str.__contains__ 的有效实施，这需要 O(n) 时间。

与语言无关：检查字符串是否由某个子字符串的倍数组成

Language Independant: Check if a string consists of a multiple of a certain substring

regex

language-agnostic

pattern-matching