查找有多少子字符串的第一个和最后一个字符在内部重复的快速方法

Question

这是我创建的子字符串的问题。我想知道如何针对此问题实施 O(nlog(n)) 解决方案，因为天真的方法非常简单。这是怎么回事。你有一个字符串 S。 S 有很多子串。在某些子字符串中，第一个字符和最后一个字符不止出现一次。找出第一个和最后一个字符出现不止一次的子串有多少。

Input: "ABCDCBE"
Expected output: 2
Explanation: "BCDCB" and "CDC" are two such substrings

该测试用例解释只有“BCDCB”和“CDC”，其中第一个字符和最后一个字符相同。

除了以“ABABCAC”作为第一个字符“A”出现3次，最后一个字符“C”出现两次的子字符串的示例案例之外，还可以有另一种情况。 "AAAABB" 也是另一个子串。

“AAAAB”不满足。

我了解到 O(nlog(n)) 可能有助于或可能不会有助于解决方案的是二叉索引树。二叉索引树可以以某种方式用来解决这个问题。还有排序和二进制搜索，但首先我想特别关注二进制索引树。

我正在寻找 space 的 O(n log(n)) 或更好的复杂度。

字符也是 UTF-16

Answer 1

我的解决方案的要点如下：

遍历输入数组，并为每个位置计算在该位置结束的 'valid' 个子字符串的数量。这些值的总和就是有效子串的总数。我们通过使用二进制索引树计算当前位置之前的子字符串的有效开始数量来实现这一点。

现在了解完整详情：

当我们遍历数组时，我们将当前元素视为子字符串的末尾，并且我们说有效开始的位置是其值再次出现在 it 之间的位置，以及我们当前迭代的位置。（即如果子字符串开头的值在其中至少出现两次）

例如：

current index              V
data  = [1, 2, 3, 4, 1, 4, 3, 2]
valid = [1, 0, 1, 1, 0, 0, 0, 0]
         0  1  2  3  4  5  6  7

第一个 1（位于索引 0）是一个有效的开始，因为在它之后还有另一个 1（位于索引 4）当前索引（索引6）。

现在，计算当前索引之前的有效开始数量，我们得到的结果与我们想要的非常接近，只是我们可能会抓取一些子字符串的最后一个值没有出现两次的子字符串（即我们目前正在迭代的那个）

例如：

current index              V
data  = [1, 2, 3, 4, 1, 4, 3, 2]
valid = [1, 0, 1, 1, 0, 0, 0, 0]
         0  1  2  3  4  5  6  7
                  ^--------^

在这里，4被标记为有效的开始（因为它后面还有一个4），但是对应的子串没有两个3。

为了解决这个问题，我们将只考虑当前值之前出现的有效启动。（这意味着子字符串将包含当前值及其先前出现的值，因此，最后一个元素将至少出现在子字符串中两次）

伪代码如下：

fn solve(arr) {
  answer := 0
  for i from 1 to length(arr) {
    previous_index := find_previous(arr, i)

    if there is a previous_index {
      arr[previous_index].is_valid_start = true
      answer += count_valid_starts_up_to_and_including(arr, previous_index)
    }
  }
  return answer
}

为了有效地实施这些操作，我们使用散列 table 来查找值的先前位置，并使用二叉索引树 (BIT) 来跟踪和计算有效位置。

因此，更充实的伪代码看起来像

fn solve(arr) {
  n := length(arr)

  prev := hash_table{}
  bit  := bit_indexed_tree{length = n}

  answer := 0
  for i from 1 to length(arr) {
    value := arr[i]
    previous_index := prev[value]

    if there is a previous_index {
      bit.update(previous_index, 1)
      answer += bit.query(previous_index)
    }

    prev[value] = i
  }
  return answer
}

最后，由于伪代码并不总是足够的，这里是 C++ 中的一个实现，其中控制流有点乱，以确保有效使用 std::unordered_map（C++ 的内置哈希 table)

class Bit { 
    std::vector<int> m_data;
public:
    // initialize BIT of size `n` with all 0s
    Bit(int n);

    // add `value` to index `i`
    void update(int i, int value);

    // sum from index 0 to index `i` (inclusive)
    int query(int i);
};

long long solve (std::vector<int> const& arr) {
    int const n = arr.size();

    std::unordered_map<int, int> prev_index;
    Bit bit(n);

    long long answer = 0;
    int i = 0;
    for (int value : arr) {

        auto insert_result = prev_index.insert({value, i});
        if (!insert_result.second) { // there is a previous index
            int j = insert_result.first->second;

            bit.update(j, 1);
            answer += bit.query(j);

            insert_result.first->second = i;
        }

        ++i;
    }

    return answer;
}

编辑：为了透明起见，这是我用来测试此代码的 Fenwick 树实现

struct Bit {
    std::vector<int> m_data;
    Bit(int n) : m_data(n+2, 0) { }
    int query(int i) {
        int res = 0;
        for(++i; i > 0; i -= i&-i) res += m_data[i];
        return res;
    }
    void update(int i, int x) {
        for(++i; i < m_data.size(); i += i&-i) m_data[i] += x;
    }
};

查找有多少子字符串的第一个和最后一个字符在内部重复的快速方法

Quick Way of Finding How many Substrings has first and last character repeated inside

algorithm

substring

time-complexity

binary-indexed-tree