查找二维矩阵是否是另一个二维矩阵的子集

Question

最近我参加了一个黑客马拉松，我开始了解一个问题，该问题试图在 2d matrix.A 模式中找到网格形式的模式可以是 U、H 和 T，并且将是用3*3矩阵表示假设我想展示 H 和 U

+--+--+--+            +--+--+--+
|1 |0 |1 |            |1 |0 |1 |
+--+--+--+            +--+--+--+
|1 |1 |1 |  --> H     |1 |0 |1 |    -> U
+--+--+--+            +--+--+--+
|1 |0 |1 |            |1 |1 |1 |
+--+--+--+            +--+--+--+

现在我需要在 10*10 matrix containing 0s and 1s 中搜索它。最接近且唯一的解决方案我可以获得 O(n^4) 的蛮力算法。在像 MATLAB 和 R 这样的语言中，有非常微妙的方法可以做到这但不是在 C，C++ 中。我尝试了很多在 Google 和 SO.But 上搜索这个解决方案，我能得到的最接近的是这个 SO POST，它讨论了实现 Rabin-Karp 字符串搜索算法 。但没有伪代码或任何 post 解释 this.Could 任何人帮助或提供任何 link、pdf 或一些逻辑来简化这个？

编辑

饰演 Eugene Sh。评论说如果 N 是大矩阵 (NxN) 和 k - 小矩阵 (kxk) 的大小，buteforce 算法应该采用 O((Nk)^2)。由于 k 是固定的，因此它减少到 O(N^2)。是的，绝对正确。但是如果N和K很大，有什么通用的方法吗？

Answer 1

让我们从 O(N * N * K) 解决方案开始。我将使用以下符号：A 是一个模式矩阵，B 是一个大矩阵（我们将在其中搜索模式的出现）。

我们可以固定B矩阵的顶行（也就是说，我们将搜索从某个位置（这一行，任何一列）开始的所有事件。我们称它为行 a topRow。现在我们可以对该矩阵进行切片，其中包含 [topRow; topRow + K) 行和所有列。
让我们创建一个新的矩阵作为串联的结果 A + column + the slice，其中 column 是具有 K 个元素的列，这些元素不存在于 A或B（如果A和B由0和1组成，我们可以使用-1。现在我们可以将这个新矩阵的列视为字母和运行 Knuth-Morris-Pratt 算法。比较两个字母需要O(K)时间，因此这一步的时间复杂度为O(N * K).

有 O(N) 种方法可以固定第一行，因此总时间复杂度为 O(N * N * K)。它已经比蛮力解决方案更好，但我们还没有完成。理论下界是O(N * N)（我假设是N >= K），我想达到

让我们看看这里有哪些可以改进的地方。如果我们可以在 O(1) 时间而不是 O(k) 内比较矩阵的两列，我们就可以达到所需的时间复杂度。让我们连接 A 和 B 的所有列，在每列之后插入一些分隔符。现在我们有一个字符串，我们需要比较它的子字符串（因为列和它们的部分现在是子字符串）。让我们在线性时间内构造一个后缀树（使用 Ukkonnen 算法）。现在比较两个子串就是找到这棵树中两个节点的最低共同祖先（LCA）的高度。有一个 algorithm 允许我们用线性预处理时间和每个 LCA 查询的 O(1) 时间来完成它。这意味着我们可以在常数时间内比较两个子串（或列）！因此，总时间复杂度为O(N * N)。还有另一种方法可以实现这种时间复杂度：我们可以在线性时间内构建一个后缀数组，并在常数时间内回答最长的公共前缀查询（使用线性时间预处理）。然而，这两个 O(N * N) 解决方案看起来都很难实现，而且它们会有很大的常数。

P.S 如果我们有一个我们可以完全信任的多项式哈希函数（或者我们可以接受一些误报），我们可以使用 2-D 得到一个更简单的 O(N * N) 解决方案多项式哈希。

Answer 2

好的，下面是 2D Rabin-Karp 方法。

对于下面的讨论，假设我们要在一个 (n, n)矩阵。（这个概念同样适用于矩形矩阵，但我运行没有索引。）

我们的想法是，对于每个可能的 sub-matrix，我们计算一个哈希值。仅当该散列与我们要查找的矩阵的散列匹配时，我们才会比较 element-wise.

为了提高效率，我们必须每次都避免 re-computing sub-matrix 的整个散列。因为我今晚睡得很少，所以我唯一可以弄清楚如何轻松做到这一点的哈希函数是各自 sub-matrix 中 1 的总和。我把它作为练习留给比我聪明的人找出更好的滚动哈希函数。

现在，如果我们刚刚检查了 sub-matrix 从 (i, j) 到 (i + m – 1, j + m – 1) 并知道它里面有x个1，我们可以计算sub-matrix右边1个中1的个数——也就是从(i, j + 1) 至 (i + m – 1, j + m) – 通过从 (i, 中减去 sub-vector 中 1 的个数]j) 到 (i + m – 1, j) 并添加sub-vector 从 (i, j + m) 到(i + m – 1, j + m ).

如果我们击中大矩阵的右边距，我们将 window 向下移动一位，然后返回到左边距，然后再次向下移动一位，然后再次向右移动，依此类推。

请注意，这需要 O(m) 次操作，而不是 O(m² ) 每个候选人。如果我们对每一对索引都这样做，我们得到 O(mn²) 的工作。因此，通过巧妙地将潜在 sub-matrix 大小的 window 移动到大矩阵，我们可以将工作量减少 m 倍。也就是说，如果我们没有得到太多哈希冲突。

这是一张图片：

当我们将当前的window右移一位时，减去左边红色列向量中1的个数，加上右边绿色列向量中1的个数获取新的window.

中1的个数

我已经使用很棒的 Eigen C++ 模板库实现了这个想法的快速演示。该示例还使用了 Boost 中的一些内容，但仅用于参数解析和输出格式化，因此如果您没有 Boost 但想尝试代码，则可以轻松摆脱它。索引摆弄有点混乱，但我将在此处不作进一步解释。上面的散文应该足以涵盖它。

#include <cassert>
#include <cstddef>
#include <cstdlib>
#include <iostream>
#include <random>
#include <type_traits>
#include <utility>

#include <boost/format.hpp>
#include <boost/lexical_cast.hpp>

#include <Eigen/Dense>

#define PROGRAM "submatrix"
#define SEED_CSTDLIB_RAND 1

using BitMatrix = Eigen::Matrix<bool, Eigen::Dynamic, Eigen::Dynamic>;
using Index1D = BitMatrix::Index;
using Index2D = std::pair<Index1D, Index1D>;

std::ostream&
operator<<(std::ostream& out, const Index2D& idx)
{
  out << "(" << idx.first << ", " << idx.second << ")";
  return out;
}

BitMatrix
get_random_bit_matrix(const Index1D rows, const Index1D cols)
{
  auto matrix = BitMatrix {rows, cols};
  matrix.setRandom();
  return matrix;
}

Index2D
findSubMatrix(const BitMatrix& haystack,
              const BitMatrix& needle,
              Index1D *const collisions_ptr = nullptr) noexcept
{
  static_assert(std::is_signed<Index1D>::value, "unsigned index type");
  const auto end = Index2D {haystack.rows(), haystack.cols()};
  const auto hr = haystack.rows();
  const auto hc = haystack.cols();
  const auto nr = needle.rows();
  const auto nc = needle.cols();
  if (nr > hr || nr > hc)
    return end;
  const auto target = needle.count();
  auto current = haystack.block(0, 0, nr - 1, nc).count();
  auto j = Index1D {0};
  for (auto i = Index1D {0}; i <= hr - nr; ++i)
    {
      if (j == 0)  // at left margin
        current += haystack.block(i + nr - 1, 0, 1, nc).count();
      else if (j == hc - nc)  // at right margin
        current += haystack.block(i + nr - 1, hc - nc, 1, nc).count();
      else
        assert(!"this should never happen");
      while (true)
        {
          if (i % 2 == 0)  // moving right
            {
              if (j > 0)
                current += haystack.block(i, j + nc - 1, nr, 1).count();
            }
          else  // moving left
            {
              if (j < hc - nc)
                current += haystack.block(i, j, nr, 1).count();
            }
          assert(haystack.block(i, j, nr, nc).count() == current);
          if (current == target)
            {
              // TODO: There must be a better way than using cwiseEqual().
              if (haystack.block(i, j, nr, nc).cwiseEqual(needle).all())
                return Index2D {i, j};
              else if (collisions_ptr)
                *collisions_ptr += 1;
            }
          if (i % 2 == 0)  // moving right
            {
              if (j < hc - nc)
                {
                  current -= haystack.block(i, j, nr, 1).count();
                  ++j;
                }
              else break;
            }
          else  // moving left
            {
              if (j > 0)
                {
                  current -= haystack.block(i, j + nc - 1, nr, 1).count();
                  --j;
                }
              else break;
            }
        }
      if (i % 2 == 0)  // at right margin
        current -= haystack.block(i, hc - nc, 1, nc).count();
      else  // at left margin
        current -= haystack.block(i, 0, 1, nc).count();
    }
  return end;
}

int
main(int argc, char * * argv)
{
  if (SEED_CSTDLIB_RAND)
    {
      std::random_device rnddev {};
      srand(rnddev());
    }
  if (argc != 5)
    {
      std::cerr << "usage: " << PROGRAM
                << " ROWS_HAYSTACK COLUMNS_HAYSTACK"
                << " ROWS_NEEDLE COLUMNS_NEEDLE"
                << std::endl;
      return EXIT_FAILURE;
    }
  auto hr = boost::lexical_cast<Index1D>(argv[1]);
  auto hc = boost::lexical_cast<Index1D>(argv[2]);
  auto nr = boost::lexical_cast<Index1D>(argv[3]);
  auto nc = boost::lexical_cast<Index1D>(argv[4]);
  const auto haystack = get_random_bit_matrix(hr, hc);
  const auto needle = get_random_bit_matrix(nr, nc);
  auto collisions = Index1D {};
  const auto idx = findSubMatrix(haystack, needle, &collisions);
  const auto end = Index2D {haystack.rows(), haystack.cols()};
  std::cout << "This is the haystack:\n\n" << haystack << "\n\n";
  std::cout << "This is the needle:\n\n" << needle << "\n\n";
  if (idx != end)
    std::cout << "Found as sub-matrix at " << idx << ".\n";
  else
    std::cout << "Not found as sub-matrix.\n";
  std::cout << boost::format("There were %d (%.2f %%) hash collisions.\n")
    % collisions
    % (100.0 * collisions / ((hr - nr) * (hc - nc)));
  return (idx != end) ? EXIT_SUCCESS : EXIT_FAILURE;
}

编译运行时，请把上面的当成pseudo-code。我几乎没有尝试优化它。这只是我自己的一个 proof-of 概念。

Answer 3

我将介绍一种算法，在最坏的情况下，通常只要 k = O(sqrt(n)) 和 O(n*n + n*k*k) 就需要 O(n*n) 的时间。这是 Aho-Corasick 到 2D 的扩展。回想一下，Aho-Corasick 定位了目标字符串 T 中一组模式的所有出现，并且它在时间上与模式长度、T 的长度和出现次数成线性关系。

让我们介绍一些术语。干草堆是我们正在搜索的大矩阵，针是模式矩阵。大海捞针是nxn矩阵，针是kxk矩阵。我们将在 Aho-Corasick 中使用的模式集是针的行集。此集合最多包含 k 行，如果有重复行，则包含的行会更少。

我们将构建 Aho-Corasick 自动机（这是一个增加了失败链接的 Trie），然后运行在大海捞针的每一行上构建搜索算法。因此，我们取出每一行针并在大海捞针的每一行中搜索它。我们可以使用线性时间一维匹配算法来执行此操作，但这仍然效率低下。 Aho-Corasick 的优点是它一次搜索所有模式。

在搜索过程中，我们将填充一个矩阵 A，稍后我们将使用它。当我们在 haystack 的第一行中搜索时，A 的第一行填充了 haystack 第一行中针的行的出现。因此，我们将以 A 的第一行结束，例如 2 - 0 - - 1。这意味着针的行号0出现在大海捞针第一行的位置2；行号 1 出现在位置 5；行号 2 出现在位置 0。- 条目是未匹配的位置。对每一行继续这样做。

现在让我们假设针中没有重复的行。将 0 分配给针的第一行，将 1 分配给第二行，依此类推。现在我们将使用线性时间一维搜索算法（例如 KMP）在矩阵 A 的每一列中搜索模式 [0 1 2 ... k-1]。回想一下，A 的每一行都存储针行出现的位置。因此，如果一列包含模式 [0 1 2 ... k-1]，这意味着针的行号 0 出现在大海捞针的某一行，针的行号 1 就在它的下方，并且很快。这正是我们想要的。如果有重复的行，只需为每个唯一的行分配一个唯一的编号。

使用线性时间算法在列中搜索需要 O(n)。因此搜索所有列需要 O(n*n)。我们在搜索期间填充矩阵，我们搜索大海捞针的每一行（有 n 行）并且一行中的搜索需要 O(n+k*k)。所以 O(n(n+k*k)) 总的来说。

所以我的想法是找到那个矩阵，然后将问题简化为一维模式匹配。 Aho-Corasick就是为了效率而存在的，不知道有没有其他高效的求矩阵的方法

编辑：添加了实现。

这是我的 C++ 实现。 n 的最大值设置为 100，但您可以更改它。

程序首先读取两个整数 n k（矩阵的维度）。然后它读取 n 行，每行包含一个长度为 n 的 0 和 1 字符串。然后它读取 k 行，每行包含一个长度为 k 的 0 和 1 字符串。输出是所有匹配项的左上角坐标。例如下面的输入。

12 2
101110111011
111010111011
110110111011
101110111010
101110111010
101110111010
101110111010
111010111011
111010111011
111010111011
111010111011
111010111011
11
10

程序将输出：

match at (2,0)
match at (1,1)
match at (0,2)
match at (6,2)
match at (2,10)

#include <cstdio>
#include <cstring>
#include <string>
#include <queue>
#include <iostream>

using namespace std;

const int N = 100;
const int M = N;
int n, m;

string haystack[N], needle[M];
int A[N][N]; /* filled by successive calls to match */
int p[N]; /* pattern to search for in columns of A */

struct Node
{   Node *a[2]; /* alphabet is binary */
    Node *suff; /* pointer to node whose prefix = longest proper suffix of this node */
    int flag;

    Node()
    {   a[0] = a[1] = 0;
        suff = 0;
        flag = -1;
    }
};

void insert(Node *x, string s)
{   static int id = 0;
    static int p_size = 0;

    for(int i = 0; i < s.size(); i++)
    {   char c = s[i];
        if(x->a[c - '0'] == 0)
            x->a[c - '0'] = new Node;
        x = x->a[c - '0'];
    }
    if(x->flag == -1)
        x->flag = id++;

    /* update pattern */
    p[p_size++] = x->flag;
}

Node *longest_suffix(Node *x, int c)
{   while(x->a[c] == 0)
        x = x->suff;
    return x->a[c];
}

Node *mk_automaton(void)
{   Node *trie = new Node;
    for(int i = 0; i < m; i++)
    {   insert(trie, needle[i]);
    }

    queue<Node*> q;

    /* level 1 */
    for(int i = 0; i < 2; i++)
    {   if(trie->a[i])
        {   trie->a[i]->suff = trie;
            q.push(trie->a[i]);
        }
        else trie->a[i] = trie;
    }

    /* level > 1 */
    while(q.empty() == false)
    {   Node *x = q.front(); q.pop();
        for(int i = 0; i < 2; i++)
        {   if(x->a[i] == 0) continue;
            x->a[i]->suff = longest_suffix(x->suff, i);
            q.push(x->a[i]);
        }
    }

    return trie;
}

/* search for patterns in haystack[j] */
void match(Node *x, int j)
{   for(int i = 0; i < n; i++)
    {   x = longest_suffix(x, haystack[j][i] - '0');
        if(x->flag != -1)
        {   A[j][i-m+1] = x->flag;
        }
    }
}

int match2d(Node *x)
{   int matches = 0;
    static int z[M+N];
    static int z_str[M+N+1];

    /* init */
    memset(A, -1, sizeof(A));

    /* fill the A matrix */
    for(int i = 0; i < n; i++)
    {   match(x, i);
    }

    /* build string for z algorithm */
    z_str[n+m] = -2; /* acts like `[=12=]` for strings */
    for(int i = 0; i < m; i++)
    {   z_str[i] = p[i];
    }

    for(int i = 0; i < n; i++)
    {   /* search for pattern in column i */
        for(int j = 0; j < n; j++)
        {   z_str[j + m] = A[j][i];
        }

        /* run z algorithm */
        int l, r;
        l = r = 0;
        z[0] = n + m;
        for(int j = 1; j < n + m; j++)
        {   if(j > r)
            {   l = r = j;
                while(z_str[r] == z_str[r - l]) r++;
                z[j] = r - l;
                r--;
            }
            else
            {   if(z[j - l] < r - j + 1)
                {   z[j] = z[j - l];
                }
                else
                {   l = j;
                    while(z_str[r] == z_str[r - l]) r++;
                    z[j] = r - l;
                    r--;
                }
            }
        }

        /* locate matches */
        for(int j = m; j < n + m; j++)
        {   if(z[j] >= m)
            {   printf("match at (%d,%d)\n", j - m, i);
                matches++;
            }
        }
    }

    return matches;
}

int main(void)
{   cin >> n >> m;
    for(int i = 0; i < n; i++)
    {   cin >> haystack[i];
    }
    for(int i = 0; i < m; i++)
    {   cin >> needle[i];
    }

    Node *trie = mk_automaton();
    match2d(trie);

    return 0;
}

查找二维矩阵是否是另一个二维矩阵的子集

Find whether a 2d matrix is subset of another 2d matrix

c

c++

algorithm

matrix