拆分重叠段

Splitting apart overlapping segments

这是一个分段向量

class Segment
{
public:
   size_t left;
   size_t right;
   char ID;
   Segment(size_t a, size_t b, char c):left(a), right(b), ID(c){assert(left<right);}
};


std::vector<Segment> A = {{3, 10, 'A'}, {7, 22, 'B'}, {14, 17, 'C'} , {16, 19, 'D'}, {25, 31, 'E'}, {28, 32, 'F'}, {34, 37, 'G'}, {34, 37, 'H'}, {46, 49, 'I'}, {52, 59, 'J'}}

矢量 A 根据属性 left 排序。可以将此向量的内容绘制为

       -------  'A'
           ---------------  'B'
                  ---  'C'
                    ---   'D' 
                             ------  'E'
                                ----   'F'
                                      ---  'G'
                                      ---  'H'
                                                  ---  'I'
                                                        -------  'J'

我想创建对象 B,其中包含来自 A 中那些大段的小的非重叠段。要达到 B,我们必须收缩彼此重叠的段,并为所有重叠的地方创建一个 ID 为 X 的新段。向量 B 也需要根据 left 属性进行排序。

对于上面的例子,预期的输出是

std::vector<Segment> B = {{3, 7, 'A'}, {7, 10, `X`}, {10, 14, 'B'}, {14, 19, 'X'}, {19, 22, 'B'} , {25, 28, 'E'}, {28, 31, 'X'}, {31, 32, 'F'}, {34, 37, 'X'}, {46, 49, 'I'}, {52, 59, 'J'}}


       ----  'A'
           ---  'X'  (overlap between 'A' and 'B')
              ----  'B'
                  --------  'X' (overlap between 'B', 'C' and 'D')
                          ---  'B'  -> Note that 'B' is now split in two
                                ---  'E'
                                   ---   'X'  (overlap between 'E' and 'F')
                                      -  'F'
                                         ---  'X' (overlap between 'G' and 'H')
                                                     ---  'I'
                                                           -------  'J'

谁能帮帮我?

请注意,与上例不同的是,A 中的两个片段实际上可以具有相同的 ID(但是,它们不可能重叠)。还要注意 A 不是 const 并且可以在操作期间修改。出于性能考虑,请注意向量 A 通常相对较短;长度在 1 到大约一百(或几百)段之间。值 leftright 通常非常大(在 0 到大约 1e9 的范围内)并且这些线段中只有少数会相交。通常,当段数很少时,这些段会很宽(当 size 为 1 时,单个段的宽度通常约为 1e9)。最后,你可以用

打印上图
void print(std::vector<Segment>& v)
{
    for (auto& elem : v)
    {
        std::cout << "{" << elem.left << ", " << elem.right << ", " << elem.ID << "} ";
    }
    std::cout << "\n";

    for (auto& elem : v)
    {
        for (size_t i = 0 ; i < elem.left ; ++i)
            std::cout << " ";
        for (size_t i = 0 ; i < elem.right - elem.left ; ++i)
            std::cout << "-";
        std::cout << "  " << elem.ID << "\n";           
    }
    std::cout << "\n\n\n";  
}

不需要对输入进行排序的算法会更好。

尝试

只是为了展示努力,这里尝试 1) 有问题,2) 实施起来相对较慢。我们称“断点”为以下 B 向量中左侧的任意右侧。这个想法是通过系统地在之前和之后的片段中搜索潜在的下一个断点来从一个断点跳到下一个断点。这样做时,它应该跟踪应该在新段中给出什么 ID(如果断点之间的距离与 A 中的至少一个段匹配)。

std::vector<Segment> foo(std::vector<Segment>& A)
{
    if (A.size() <= 1) return A;

    std::vector<Segment> B;
    B.reserve(A.size());
    
    size_t A_index = 0;
    size_t currentPos = A[A_index].left;
    while ( A_index < A.size())
    {
        auto nextPos = A[A_index].right;

        //std::cout << "currentPos = " << currentPos << "\n";
        //std::cout << "nextPos before search = " << nextPos << "\n";

        bool isIntersection = false;
        // Search in preceding Segments
        for (size_t i = A_index - 1 ; i < A.size() ; --i)
        {
            if (A[i].right > currentPos && A[i].right < nextPos )
            {
                nextPos = A[i].right;
                isIntersection = true;
                //std::cout << "Found " << nextPos << " in preceding segment\n";
            }
        }

        // Search in following Segments
        for (size_t i = A_index+1 ; i < A.size() ; ++i)
        {
            if ( A[i].left > currentPos && A[i].left < nextPos)
            {
                nextPos = A[i].left;
                //std::cout << "Found left of " << nextPos << " in following segment\n";
                break;
            }

            if ( A[i].right > currentPos &&  A[i].right < nextPos )
            {
                nextPos = A[i].right;
                isIntersection = true;
                //std::cout << "Found right of " << nextPos << " in following segment\n";
                break;
            }
        }

        // create new Segment
        if (!isIntersection)
        {
            B.push_back({currentPos, nextPos, A[A_index].ID});
        } else
        {
            B.push_back({currentPos, nextPos, 'X'});
        }
        if (nextPos == A[A_index].right)
        {
            ++A_index;
            nextPos = A[A_index].left;
        }
        currentPos = nextPos;
    }

    return B;
}


int main()
{
    std::vector<Segment> A = {{3, 10, 'A'}, {7, 22, 'B'}, {14, 17, 'C'} , {16, 19, 'D'}, {25, 31, 'E'}, {28, 32, 'F'}, {34, 37, 'G'}, {34, 37, 'H'}, {46, 49, 'I'}, {52, 59, 'J'}};
    print(A);
    auto B = foo(A);
    print(B);
}

以下不是最有效的,但它产生了预期的输出。策略如下:

  1. (其实不需要输入向量排序)
  2. 将所有片段分解成 1 个宽的片段。将它们存储在地图中,该地图将位置映射到该位置存在的 ID
  3. 根据重叠分配新 ID
  4. 再次将碎片粘在一起

为方便起见,我使用了

struct Overlap {
    std::vector<char> IDs;
    Overlap() {}
    void add(char id) {IDs.push_back(id);}
};

而我最后一步的实现需要在构造函数中删除 left<right 的要求。

int main() {
    std::vector<Segment> A = {{3, 10, 'A'}, {7, 22, 'B'}, {14, 17, 'C'} , {16, 19, 'D'}, {25, 31, 'E'}, {28, 32, 'F'}, {34, 37, 'G'}, {34, 37, 'H'}, {46, 49, 'I'}, {52, 59, 'J'}};   
    // dissect
    std::map<size_t,Overlap> over;
    for (const auto& s : A) {
        for (size_t i = s.left; i < s.right; ++i) {
            over[i].add(s.ID);
        }
    }
    // assign new segments 
    std::map<size_t,char> pieces;
    for (const auto& o : over) {
        if (o.second.IDs.size() == 1) {
            pieces[o.first] = o.second.IDs.front();
        } else {
            pieces[o.first] = 'X';
        }
    }
    // glue them
    std::vector<Segment> result;
    auto it = pieces.begin();
    Segment current(it->first,it->first,it->second);  // here left==right !
    ++it;
    for ( ; it != pieces.end(); ++it) {
        if (it->second == current.ID) continue;
        current.right = it->first -1;
        result.push_back(current);
        current = Segment{it->first,it->first,it->second};
    }

    print(result);
}

输出:

{3, 6, A} {7, 9, X} {10, 13, B} {14, 18, X} {19, 24, B} {25, 27, E} {28, 30, X} {31, 33, F} {34, 45, X} {46, 51, I} 
   ---  A
       --  X
          ---  B
              ----  X
                   -----  B
                         --  E
                            --  X
                               --  F
                                  -----------  X
                                              -----  I

Live Example

实际上唯一昂贵的部分是 1(O(段数 x 宽度)。我相信使用两个容器可以提高效率,一个根据 left 排序,一个根据 right,以便更容易检测重叠。但是,上面的天真实现是我要开始的。另外,如果宽度与段数相比相当小,那么它可能是 O(段数 x 宽度)排序优于 O(log 段数 x 段数)。

这里有一个解决方案,它计算由段创建的所有过渡点,然后使用这些点重建新段。

算法是:

  1. 每段产生2个过渡点,一个用于段的开始,一个用于段的结束。

  2. 转换点已排序。

  3. 从每一对相邻的过渡点构建新的线段。每对点代表:

    a) 一个空段(没有添加新的段)

    b) 单个段(添加.ID段)

    c) 多段(添加'X'段)

  4. 新建的段可能包含相邻的X段,因此需要合并。

首先,一个简单的结构来存储转换点:

struct Point
{
    size_t location; 
    bool overlap;    // does this point start/close a new segment
    char ID;
};

实现是:

std::vector<Segment> foo(std::vector<Segment> const & segments)
{
    // generate all transition points
    std::vector<Point> points;
    for (auto const & seg : segments)
    {
        points.push_back({seg.left, true, seg.ID});
        points.push_back({seg.right, false, seg.ID});
    }

    // sort transition points
    std::sort(points.begin(), points.end(), 
      [](auto a, auto b) { return a.location < b.location; });

    std::vector<Segment> res;

    // initialize overlaps
    std::multiset<char> overs{points[0].ID};

    // for every adjacent transition point
    for(auto i = 1u; i < points.size(); ++i) 
    {
        auto &a = points[i - 1];
        auto &b = points[i];

        // if there is a jump in between transition points
        if (a.location < b.location)
           switch (overs.size())
           {
               // no segment
               case 0 : break;
               // ony one segment
               case 1 : res.push_back({a.location, b.location, *overs.begin()}); break;
               // overlapping segment
               default : res.push_back({a.location, b.location, 'X'}); break;
           }

        // update overlaps
        if (b.overlap)
           overs.insert(b.ID);
        else
           overs.erase(overs.find(b.ID));  
    }
    
    // merge adjacent 'X' overlaps 
    for(auto i = 0u; i < res.size(); ++i) 
    {
         if (res[i].ID == 'X')
         {
           auto f = std::find_if(res.begin() + i + 1, res.end(),
            [](auto r) { return r.ID != 'X'; });
           res[i].right = (f - 1)->right;
           res.erase(res.begin() + i + 1, f); 
         }
     }
        
    return res;
}

这是一个 O(n log(n)) 算法。

这里是 demo