Burrows-Wheeler 变换 (BWT) 重复字符串

Burrows-Wheeler Transform (BWT) repeating string

我正在 Python 中编写 Burrows-Wheeler 变换及其反转。它适用于小弦,但当我测试更大的弦时它就崩溃了。在某些时候,字符串似乎会循环。我确定它一定与解码器的最后一个循环有关,但我正在按照在多个网站上找到的步骤进行操作。我的实现如下:

class BurrowsWheelerTransform:

    def __init__(self, data):
        self.data = data

    def transform(self):
        # get data size
        size = len(self.data)
        # get order (by index) of rotations
        order = sorted(range(size), key=lambda i: self.data[i:])
        # get index of original rotation
        index = order.index(0)
        # return size appended with last column of (imaginary) rotation table
        return chr(255) * (index // 255) + chr(index % 255) + ''.join(self.data[(i - 1 + size) % size] for i in order)

    def restore(self):
        # get index of end of index
        eoi = next(i for i in range(len(self.data)) if ord(self.data[i]) < 255)
        # get index
        index = 255 * eoi + ord(self.data[eoi])
        # get tranformed content
        content = self.data[eoi + 1:]
        # get lshift array
        lshift = [i - 1 for symbol in sorted(set(content)) for i, x in enumerate(self.data) if x == symbol]
        # restore
        restored = ''
        for i in range(len(content)):
            index = lshift[index]
            restored += content[index]
        # return restored
        return restored

原始字符串:

Unwilling to obtrude himself on the princess, Rostv did not go back to the house but remained in the village awaiting her departure. When her carriage drove out of the house, he mounted and accompanied her eight miles from Boguchrovo to where the road was occupied by our troops. At the inn at Yankvo he respectfully took leave of her, for the first time permitting himself to kiss her hand.

How can you speak so! he blushingly replied to Princess Marys expressions of gratitude for her deliverance, as she termed what had occurred. Any police officer would have done as much! If we had had only peasants to fight, we should not have let the enemy come so far, said he with a sense of shame and wishing to change the subject. I am only happy to have had the opportunity of making your acquaintance. Good-by, Princess. I wish you happiness and consolation and hope to meet you again in happier circumstances. If you dont want to make me blush, please dont thank me!

解码后的字符串:

Unwilling to obtrude himself on the princess, Rostv did not go back to the house but remained in the village awaiting her departure. When her carriage drove out of the house, he mounted and accompanied her eight miles from Boguchrovo to where the road was occupied by our troops. At the inn at Yankvo he respectfully took leave of her, for the first time permitting himself to kiss her hand.

How can you speak so!Unwilling to obtrude himself on the princess, Rostv did not go back to the house but remained in the village awaiting her departure. When her carriage drove out of the house, he mounted and accompanied her eight miles from Boguchrovo to where the road was occupied by our troops. At the inn at Yankvo he respectfully took leave of her, for the first time permitting himself to kiss her hand.

How can you speak so!Unwilling to obtrude himself on the princess, Rostv did not go back to the house but remained in the village awaiting her departure. When

奇怪的是,我在网上找到并测试过的其他实现似乎也会发生这种情况,比如 this one and this one。到底是怎么回事?我是否误解了转换的工作原理?或者这个实现不正确?

找到答案了!该算法基于以下假设:字符串连接到一个最终字符,该字符是唯一的,并且在字典序上小于字符串中的任何其他字符。但是,由于我打算对任何给定字符使用 0-255 范围内的任何值,因此我无法使用额外的符号。幸运的是,thanks to John Kurlak 和一些额外的错误修复,我已经能够稍微修改我的初始实现以解决这个问题。这是:

class BurrowsWheelerTransform:

    def __init__(self, data):
        self.data = data

    def transform(self):
        # get data size
        size = len(self.data)
        # get doubled string
        self.data *= 2
        # get order (by index) of rotations
        order = sorted(range(size), key=lambda i: self.data[i:])
        # get index of original rotation
        index = order.index(0)
        # return index appended with last column of (imaginary) rotation table
        return chr(255) * (index // 255) + chr(index % 255) + ''.join(self.data[(i - 1 + size) % size] for i in order)

    def restore(self):
        # get index of end of index
        eoi = next(i for i in range(len(self.data)) if ord(self.data[i]) < 255)
        # get index
        index = 255 * eoi + ord(self.data[eoi])
        # get tranformed content
        content = self.data[eoi + 1:]
        size = len(content)
        # get lshift array
        lshift = [i for symbol in sorted(set(content)) for i, x in enumerate(content) if x == symbol]
        # restore
        restored = ''
        for i in range(size):
            index = lshift[index]
            if index >= size: break
            restored += content[index]
        # return restored
        return restored

谢谢,约翰!

我在 C++ 中遇到了与 BWT 排序算法相同的问题,感谢您的评论,我通过使用 16 位数组而不是 8 位数组快速修复了它,以允许最后一个字符的值高于 0xFF (255) (还必须稍微修改 BWT 排序算法代码以使其接受 16 位数据)。现在可以正常使用了,谢谢。