按词汇顺序生成大量字符串的紧凑方法

Question

我想生成具有以下属性的字符串序列：

按词法排序
理论上无限大
在实际范围内紧凑
通过简单的递增过程生成
匹配正则表达式 /\w+/

生成词法排序序列的明显方法是选择一个字符串长度并用这样的基值填充字符串：000000、000001 等。这种方法提出了一个排列数量和紧凑性之间的权衡：足够长以产生许多排列的字符串将沿途填充许多零。另外，我选择的长度设置了排列总数的上限，除非我有某种机制可以在字符串达到最大值时对其进行扩展。

所以我想出了一个这样的序列：

每个字符串由一个 "head" 组成，它是一个 base-36 数字，后跟一个下划线，然后是 "tail"，它也是一个 base-36 数字，由递增的数字填充零的数量
第一个循环从 0_0 到 0_z
第二个循环从1_00到1_zz
第三个循环从2_000到2_zzz，依此类推
一旦头部达到 z 并且尾部由 36 z 组成，第一个 "supercycle" 就结束了。现在整个序列重新开始，除了 z 仍然在开头，所以新的循环从 z0_0 开始，然后继续到 z1_00，依此类推
第二个超级周期zz0_0、zz1_00，依此类推

尽管头部的 z 字符串在较长的运行中可能会变得笨拙，但单个超级循环包含超过 10^56 个排列，这远远超过我预期使用的数量.该序列理论上是无限的，但在现实范围内非常紧凑。例如，万亿次排列是简洁的 7_bqd55h8s.

我可以用这个 javascript 函数相对简单地生成序列：

function genStr (n) {
  n = BigInt(n);
  let prefix = "",
      cycle = 0n,
      max = 36n ** (cycle + 1n);
  while (n >= max) {
    n -= max;
    if (cycle === 35n) {
      prefix += "z";
      cycle = 0n;
    } else {
      cycle++;
    }
    max = 36n ** (cycle + 1n);
  }
  return prefix
         + cycle.toString(36)
         + "_"
         + n.toString(36).padStart(Number(cycle) + 1, 0);
}

n 参数是一个数字，我将其递增并传递给函数以获取序列的下一个成员。我只需要跟踪一个简单的整数，使序列非常易于使用。

所以显然我在这上面花了很多时间，我认为它很好，但我想知道是否有更好的方法。是否有一种好的算法可以生成与我正在寻找的序列一致的序列？

Answer 1

与您的想法相近。（比我的第一次编辑更精炼...）。

让我们的字母表为 A = {0,1,2,3}。

设 |2| 表示我们从 0 迭代到 2，|2|^2 表示我们以词法排序的方式生成笛卡尔积 (00,01,10,11)。

我们从

开始

0 |3|

所以我们有一个长度为 2 的字符串。我们 "unshift" 数字 1 "factorizes" 因为任何 0|3|... 都小于 1|3|^2。

1 |3|^2

相同的想法：取消移位 2，并生成长度为 4 的单词。

2 |3|^3

现在我们可以继续生成

3 |2| |3|^3

注意 |2| 而不是 |3|。现在我们的最大数量变成 32333。和你一样，我们现在可以添加进位并开始一个新的超级循环：

33 0|3|

这是一个小改进，因为 _ 现在可以成为我们字母表的一部分：我们不需要将其保留为标记分隔符。

在我们的例子中，我们可以用超级循环表示：

n + n^2 + ... + n^(n-1) + (n-1) * n^(n-1)
\-----------------------/\--------------/
  geometric                special

在你的情况下，特殊部分将是 n^n（理论上你少了一个字符的细微差别，所以到处都用 n-1 替换 n）

提议的超级周期长度：

P = (n \sum_{k = 0}^{n-2} n^k) + (n-1) * n^(n-1) 
P = (n \sum_{k = 0}^{n-3} n^k) + n^n
P = n(n^{n-2} - 1)/(n-1) + n^n

这是字母表 A={0,1,2}

的示例差异

my  genStr(grandinero)
,00 0_0                 
,01 0_1                 
,02 0_2                 
,100 1_00               
,101 1_01               
,102 1_02               
,110 1_10               
,111 1_11               
,112 1_12               
,120 1_20               
,121 1_21               
,122 1_22               
,2000 2_000     
,2001 2_001     
,2002 2_002     
,2010 2_010     
,2011 2_011     
,2012 2_012     
,2020 2_020     
,2021 2_021     
,2022 2_022     
,2100 2_100     
,2101 2_101     
,2102 2_102     
,2110 2_110     
,2111 2_111     
,2112 2_112     
,2120 2_120     
,2121 2_121     
,2122 2_122     
22,00 2_200 <-- end of my supercycle if no '_' allowed
22,01 2_201     
22,02 2_202     
22,100 2_210     
22,101 2_211     
22,102 2_212     
22,110 2_220     
22,111 2_221     
22,112 2_222 <-- end of yours    
22,120 z0_0

也就是说，对于给定的数字 x，我们可以计算出有多少个超级循环 (E(x / P))，每个超级循环有两个前导 e (e 是 A 的最后一个字符）。例如：A = {0,1,2} 和 x = 43

e = 2
P = n(n^{n-2} - 1)/(n-1) + n^n = 3(3^1 -1)/2 + 27 = 30
// our supercycle is of length 30
E(43/30) = 1 // 43 makes one supercycle and a few more "strings"
r = x % P = 13 // this is also x - (E(43/30) * 30) (the rest of the euclidean division by P)

那么对于剩下的(r = x % P)两种情况要考虑：

要么我们落在几何序列中
要么我们属于(n-1) * n^(n-1)部分。

1。用累加和寻址几何序列(`x < S_w`)

设S_i为n, n^2,..

的总和

S_i = n\sum_{k = 0}^{i-1} n^k
S_i = n/(n-1)*(n^i - 1)

这给出 S_0 = 0, S_1 = n, S_2 = n + n^2...

所以基本上，如果 x < S_1，我们得到 0(x)，elif x < S_2，我们得到 1(x-S_1)

让S_w = S_{n-1}计算我们可以表示的所有数字。

如果 x <= S_w 那么我们希望 i 这样 S_i < x <= S_{i+1} <=> n^i < (n-1)/n * x + 1 <= n^{i+1}

然后我们可以应用一些原木地板 (base(n)) 来获得 i。

然后我们可以关联字符串：A[i] + base_n(x - S_i).

插图：

这次 A = {0,1,2,3}。

设x为17。

我们连续的S_i是：

S_0 = 0
S_1 = 4
S_2 = S_1 + 4^2 = 20
S_3 = S_2 + 4^3 = 84
S_w = S_{4-1} = S_3 = 84

x=17确实小于84，我们将可以将其影响到S_i范围之一。特别是 S_1==4 < x==17 <= S_2==20.

我们删除了由前导 0 编码的字符串（这些字符串有 S_1 个）。

编码前导1的位置是 x - 4 = 13.

我们得出结论，用前导 1 生成的 13 的字符串是 base_4(13) = '31'（同上字符串 -> '131'）

如果我们有 x = 21，我们会删除 S_2 的计数，所以 21-20 = 1，这反过来给出前导 2 的字符串 '2001'。

2。在特殊部分寻址 x (`x >= S_w`)

让我们考虑下面的研究案例：

和A = {0,1,2} 特殊部分是

2 |1| |2|^2 即：

第二列的每个递增数字（此处为 0 至 1（由 |1| 指定））给出 3^2 组合。

这类似于几何级数，只是这里每个范围都是常数。我们想找到范围，这意味着我们知道要为哪个字符串添加前缀。

我们可以将其表示为矩阵

20 (00,01,02,10,11,12,20,21,22)
21 (00,01,02,10,11,12,20,21,22)

括号中的部分是我们的矩阵。

一行中的每个项目只是它的位置 base_3（左填充 0）。

例如：n=7 具有 base_3 值“21”。 (7=2*3+1)。 “21”确实出现在行中的 7 位置。

假设我们得到一些x（相对于那个特殊部分）。

E(x / 3^2) 给我们行号（这里 E(7/9) = 0 所以前缀是 '20'）
x % 3^2 给我们在行中的位置（这里 base_3(7%9)='21' 给我们最后的字符串 '2021'）

如果我们想观察它，请记住我们之前减去S_w=12得到x = 7，所以我们会调用myGen(7+12)

一些代码

请注意，只要我们站在 "geometric" 范围内，没有超周期，输出是相同的。
很明显，什么时候进位开始出现，就看我能不能用'_'了。如果是，我的话会变短，否则会变长。

// https://www.cs.sfu.ca/~ggbaker/zju/math/int-alg.html
// \w insensitive could give base64
// but also éè and other accents...
function base_n(x, n, A) {
  const a = []
  while (x !== 0n) {
    a.push(A[Number(x % n)])
    x = x / n // auto floor with bigInt
  }
  return a.reverse().join('')
}

function mygen (A) {
  const n = A.length
  const bn = BigInt(n)

  const A_last = A[A.length-1]
  const S = Array(n).fill(0).map((x, i) => bn * (bn ** BigInt(i) - 1n) / (bn - 1n))
  const S_w = S[n-1]

  const w = S_w + (bn - 1n) * bn ** (bn - 1n)
  const w2 = bn ** (bn - 1n)
  const flog_bn = x => {
    // https://math.stackexchange.com/questions/1627914/smart-way-to-calculate-floorlogx
    let L = 0
    while (x >= bn) {
      L++
      x /= bn
    }
    return L
  }
  return function (x) {

    x = BigInt(x)
    let r = x % w
    const q = (x - r) / w
    let s
    if (r < S_w) {
      const i = flog_bn(r * (bn - 1n) / bn + 1n)
      const r2 = r - S[i]
      s = A[i] + base_n(r2, bn, A).padStart(i+1, '0')
    } else {
      const n2 = r - S_w
      const r2 = n2 % w2
      const q2 = (n2 - r2 ) / w2
      s = A_last + A[q2] + base_n(r2, bn, A).padStart(n-1, '0')
    }
    // comma below __not__ necessary, just to ease seeing cycles
    return A_last.repeat(2*Number(q)) +','+ s 
  }
}

function genStr (A) {
  A = A.filter(x => x !== '_')
  const bn_noUnderscore = BigInt(A.length)
  return function (x) {
    x = BigInt(x);
    let prefix = "",
        cycle = 0n,
        max = bn_noUnderscore ** (cycle + 1n);
    while (x >= max) {
      x -= max;
      if (cycle === bn_noUnderscore - 1n) {
        prefix += "z";
        cycle = 0n;
      } else {
        cycle++;
      }
      max = bn_noUnderscore ** (cycle + 1n);
    }
    return prefix
           + base_n(cycle, bn_noUnderscore, A)
           + "_"
           + base_n(x, bn_noUnderscore, A).padStart(Number(cycle) + 1, 0);    
  }
}
function test(a, b, x){
  console.log(a(x), b(x))
}
{
  console.log('---my supercycle is shorter if underscore not used. Plenty of room for grandinero')
  const A = '0123456789abcdefghijklmnopqrstuvwxyz'.split('').sort((a,b)=>a.localeCompare(b))
  let my = mygen(A)
  const grandinero = genStr(A)
  test(my, grandinero, 1e4)
  test(my, grandinero, 1e12)
  test(my, grandinero, 106471793335560744271846581685593263893929893610517909620n) // cycle ended for me (w variable value)
}
{
  console.log('---\n my supercycle is greater if underscore is used in my alphabet (not grandinero since "forbidden')
  // underscore used
  const A = '0123456789abcdefghijklmnopqrstuvwxyz_'.split('').sort((a,b)=>a.localeCompare(b))
  let my = mygen(A)
  const grandinero = genStr(A)
  test(my, grandinero, 1e12)
  test(my, grandinero, 106471793335560744271846581685593263893929893610517909620n) // cycle ended for me (w variable value)
  test(my, grandinero, 1e57) // still got some place in the supercycle
}

Answer 2

在考虑了@kaya3 和@grodzi 提供的建议并查看了我的原始代码后，我做了一些改进。我意识到一些事情：

我的原始代码中存在错误。如果一个循环在 z_z 结束（实际上是在下划线之后的 36 个 z，但你明白了）而下一个循环在 z0_0 开始，那么词法排序就被打破了，因为 _ 出现在 0。分隔符（或“颈部”）在词汇顺序上需要低于头部的最低可能值。
虽然我最初反对滚动自定义 baseN 生成器以便包含更多字符的想法，但我现在接受了这个想法。
我可以通过增加琴颈来从给定的弦长中挤出更多的排列。例如，我可以从 A00...A0z 到 A10...A1z，依此类推，从而在我继续 [=19] 之前增加以 A 作为头部可以生成的唯一字符串的数量=].

考虑到这一点，我修改了我的代码：

// this is the alphabet used in standard baseN conversions:

let baseAlpha = "0123456789abcdefghijklmnopqrstuvwxyz";

// this is a factory for creating a new string generator:

function sequenceGenerator (config) {
  let 
      // alphabets for the head, neck and body:

      headAlpha = config.headAlpha,
      neckAlpha = config.neckAlpha,
      bodyAlpha = config.bodyAlpha,

      // length of the body alphabet corresponds to the
      // base of the numbering system:

      base = BigInt(bodyAlpha.length),

      // if bodyAlpha is identical to an alphabet that
      // would be used for a standard baseN conversion,
      // then use the built-in method, which should be
      // much faster:

      convertBody = baseAlpha.startsWith(bodyAlpha)
                    ? (n) => n.toString(bodyAlpha.length)

      // otherwise, roll a custom baseN generator:

                    : function (n) {
                        let s = "";
                        while (n > 0n) {
                          let i = n % base;
                          s = bodyAlpha[i] + s;
                          n = n / base;
                        }
                        return s;
                      },

      // n is used to cache the last iteration and is
      // incremented each time you call `getNext`
      
      // it can optionally be initialized to a value other
      // than 0:

      n = BigInt(config.start || 0),

      // see below:

      headCycles = [0n],
      cycleLength = 0n;
      
  // the length of the body increases by 1 each time the
  // head increments, meaning that the total number of
  // permutations increases geometrically for each
  // character in headAlpha

  // here we cache the maximum number of permutations for
  // each length of the body
    
  // since we know these values ahead of time, calculating
  // them in advance saves time when we generate a new
  // string
    
  // more importantly, it saves us from having to do a
  // reverse calculation involving Math.log, which requires
  // converting BigInts to Numbers, which breaks the
  // program on larger numbers:

  for (let i = 0; i < headAlpha.length; i++) {

    // the maximum number of permutations depends on both
    // the string length (i + 1) and the number of
    // characters in neckAlpha, since the string length
    // remains the same while the neck increments

    cycleLength += BigInt(neckAlpha.length) * base ** BigInt(i + 1);
    headCycles.push(cycleLength);
  }

  // given a number n, this function searches through
  // headCycles to find where the total number of
  // permutations exceeds n
  
  // this is how we avoid the reverse calculation with
  // Math.log to determine which head cycle we are on for
  // a given permutation:

  function getHeadCycle (n) {
    for (let i = 0; i < headCycles.length; i++) {
      if (headCycles[i] > n) return i;
    }
  }

  return {

    cycleLength: cycleLength,

    getString: function (n) {
      let cyclesDone = Number(n / cycleLength),
          headLast = headAlpha[headAlpha.length - 1],
          prefix = headLast.repeat(cyclesDone),
          nn = n % cycleLength,
          headCycle = getHeadCycle(nn),
          head = headAlpha[headCycle - 1],
          nnn = nn - headCycles[headCycle - 1],
          neckCycleLength = BigInt(bodyAlpha.length) ** BigInt(headCycle),
          neckCycle = nnn / neckCycleLength,
          neck = neckAlpha[Number(neckCycle)],
          body = convertBody(nnn % neckCycleLength);
      body = body.padStart(headCycle , bodyAlpha[0]);
      return prefix + head + neck + body;
    },

    getNext: function () { return this.getString(n++); }

  };
}

let bodyAlpha = "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghijklmnopqrstuvwxyz",
    getStr = sequenceGenerator({

      // achieve more permutations within a supercycle
      // with a larger headAlpha:

      headAlpha: "123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz",

      // the highest value of neckAlpha must be lower than
      // the lowest value of headAlpha:

      neckAlpha: "0",
      bodyAlpha: bodyAlpha
    });

console.log("---supercycle length:");
console.log(Number(getStr.cycleLength));
console.log("---first two values:")
console.log(getStr.getNext());
console.log(getStr.getNext());
console.log("---arbitrary large value (1e57):");
console.log(getStr.getString(BigInt(1e57)));
console.log("");

// here we use a shorter headAlpha and longer neckAlpha
// to shorten the maximum length of the body, but this also
// decreases the number of permutations in the supercycle:

getStr = sequenceGenerator({
  headAlpha: "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz",
  neckAlpha: "0123456789",
  bodyAlpha: bodyAlpha
});

console.log("---supercycle length:");
console.log(Number(getStr.cycleLength));
console.log("---first two values:");
console.log(getStr.getNext());
console.log(getStr.getNext());
console.log("---arbitrary large value (1e57):");
console.log(getStr.getString(BigInt(1e57)));

编辑

经过与@grodzi的进一步讨论，我做了一些改进：

我意识到“颈部”或分隔器没有提供太多价值，所以我已经摆脱了它。 稍后编辑：实际上，分隔符是必需的。我不确定为什么我认为不是。如果没有分隔符，每个新超循环的开始在词法上将先于前一个超循环的结束。我没有更改下面的代码，但任何使用此代码的人都应该包含一个分隔符。我也意识到使用下划线作为分隔符是错误的。分隔符必须是一个字符，例如连字符，它在词汇上位于序列 (0) 中使用的最低数字之前。
我采纳了@grodzi 的建议，允许尾巴的长度无限期地继续增长。

这是新代码：

let baseAlpha = "0123456789abcdefghijklmnopqrstuvwxyz";

function sequenceGenerator (config) {

  let headAlpha = config.headAlpha,
      tailAlpha = config.tailAlpha,
      base = BigInt(tailAlpha.length),
      convertTail = baseAlpha.startsWith(tailAlpha)
                    ? (n) => n.toString(tailAlpha.length)
                    : function (n) {
                        if (n === 0n) return "0";
                        let s = "";
                        while (n > 0n) {
                          let i = n % base;
                          s = tailAlpha[i] + s;
                          n = n / base;
                        }
                        return s;
                      },
      n = BigInt(config.start || 0);

  return {

    getString: function (n) {
      let cyclesDone = 0n,
          headCycle = 0n,
          initLength = 0n,
          accum = 0n;
      for (;; headCycle++) {
        let _accum = accum + base ** (headCycle + 1n + initLength);
        if (_accum > n) {
          n -= accum;
          break;
        } else if (Number(headCycle) === headAlpha.length - 1) {
          cyclesDone++;
          initLength += BigInt(headAlpha.length);
          headCycle = -1n;
        }
        accum = _accum;
      }
      let headLast = headAlpha[headAlpha.length - 1],
          prefix = headLast.repeat(Number(cyclesDone)),
          head = headAlpha[Number(headCycle)],
          tail = convertTail(n),
          tailLength = Number(headCycle + initLength);
      tail = tail.padStart(tailLength, tailAlpha[0]);
      return prefix + head + tail;
    },

    getNext: function () { return this.getString(n++); }

  };
}

let alpha = "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghijklmnopqrstuvwxyz",
    genStr = sequenceGenerator({headAlpha: alpha, tailAlpha: alpha});

console.log("--- first string:");
console.log(genStr.getString(0n));
console.log("--- 1e+57");
console.log(genStr.getString(BigInt(1e+57)));
console.log("--- end of first supercycle:");
console.log(genStr.getString(63n*(1n-(63n**63n))/(1n-63n)-1n));
console.log("--- start of second supercycle:");
console.log(genStr.getString(63n*(1n-(63n**63n))/(1n-63n)));

按词汇顺序生成大量字符串的紧凑方法

Compact way to produce a large sequence of strings in lexical order

algorithm

sequence

1。用累加和寻址几何序列(`x < S_w`)

2。在特殊部分寻址 x (`x >= S_w`)

一些代码

按词汇顺序生成大量字符串的紧凑方法

Compact way to produce a large sequence of strings in lexical order

algorithm

sequence

1。用累加和寻址几何序列(x < S_w)

2。在特殊部分寻址 x (x >= S_w)

一些代码

1。用累加和寻址几何序列(`x < S_w`)

2。在特殊部分寻址 x (`x >= S_w`)