检查 bash 中的索引数组是稀疏的还是密集的

Question

我在 bash 中有一个动态生成的索引数组，想知道它是 sparse 还是 dense。
如果在最后一个条目之前有未设置的索引，则数组是稀疏的。否则数组是密集的。

检查应该适用于所有情况，即使是空数组、非常大的数组（扩展时超过 ARG_MAX），当然还有具有任意条目的数组（例如空条目或包含 [=13 的条目） =]、\、空格和换行符）。后者应该相当简单，因为您可能无论如何都不想扩展数组的值。

理想情况下，支票应该高效且便携。

这里有一些基本的测试用例来检查你的解决方案。您的检查可以使用硬编码的全局变量名称 a 以与旧的 bash 版本兼容。对于 bash 4.3 及更高版本，您可能需要使用 local -n isDense_array="" 来代替，以便您可以指定要检查的数组。

isDense() {
  # INSERT YOUR CHECK HERE
  # if array `a` is dense, return 0 (success)
  # if array `a` is sparse, return any of 1-255 (failure) 
}
test() {
  isDense && result=dense || result=sparse 
  [[ "$result" = "$expected" ]] ||
  echo "Test in line $BASH_LINENO failed: $expected array considered $result"
}

expected=dense
a=(); test
a=(''); test
a=(x x x); test

expected=sparse
a=([1]=x); test
a=([1]=); test
a=([0]=x [2]=x); test
a=([4]=x [5]=x [6]=x); test
a=([0]=x [3]=x [4]=x [13]=x); test

要对您的支票进行基准测试，您可以使用

a=($(seq 9999999))
time {
  isDense
  unset 'a[0]'; isDense
  a[0]=1; unset 'a[9999998]'; isDense
  a=([0]=x [9999999999]=x); isDense
}

Answer 1

方法

非空、密集数组的索引从 0 到 ${#a[*]}-1。由于鸽巢原理，稀疏数组的last索引必须大于等于${#a[@]}.

Bash 脚本

为了获得最后一个索引，我们假设索引列表 ${!a[@]} 是按升序排列的。 Bash 的手册没有指定任何顺序，但是（至少对于 bash 5 及以下版本）实现保证了这个顺序（在源代码文件 array.c 中搜索 array_keys_to_word_list).

isDense() {
  [[ "${#a[*]}" = 0 || " ${!a[*]}" == *" $((${#a[*]}-1))" ]]
}

对于小型阵列，这非常有效。对于巨大的数组，由于 ${!a[*]}，检查有点慢。该问题的基准耗时 9.8 秒。

可加载Bash内置

这个答案中的方法只需要 last 索引。但是 bash 只允许使用 ${!a[*]} 提取 all 索引，这是不必要的慢。在内部，bash 知道最后一个索引是什么。因此，如果你愿意，你可以编写一个可加载的内置函数来访问 bash 的内部数据结构。

当然这不是一个真正实用的解决方案。如果性能真的那么重要，您不应该使用 bash 脚本。尽管如此，我写了这样一个内置函数是为了好玩。

Loadable bash builtin

上述内置函数的space和时间复杂度与数组的大小和结构无关。检查 isdense a 应该和 b=1.

一样快

Answer 2

更新：（重新）运行在 Ubuntu 20 VM 中进行测试（比以前在 cygwin/bash 环境中进行的测试要好得多; 时间更接近 Socowi 报告的时间)

注意： 我用 1000 万个条目填充了数组 ({0..9999999})。

使用与 Socowi 相同的假设...

To get the last index of an array, we assume that the list of indices ${!a[@]} is in ascending order. Bash's manual does not specify any order, but (at least for bash 5 and below) the implementation guarantees this order

...我们可以对typeset -p a的输出做出相同的假设，即输出按索引排序。

查找最后一个索引：

$ a[9999999]='[abcdef]'
$ typeset -p a | tail -c 40
9999998]="9999998" [9999999]="[abcdef]")

第一次尝试 使用 awk 剥离最后一个索引：

$ typeset -p a | awk '{delim="[][]"; split($NF,arr,delim); print arr[2]}'
9999999

这出奇地慢（> 3.5 分钟对于 cygwin/bash）而运行在 100%（单个）cpu 利用率和进食最多 ~4 GB 内存。

第二次尝试 使用 tail -c 限制馈送到 awk 的数据：

$ typeset -p a | tail -c 40 | awk '{delim="[][]"; split($NF,arr,delim); print arr[2]}'
9999999

它以约 1.6 秒 (ubuntu/bash) 的（相对）惊人速度出现。

第三次尝试 在使用 awk

解析之前将最后一个数组值保存在局部变量中

备注：

正如 Socowi 所指出的（在评论中），最后一个元素的内容可能包含多个字符（空格、方括号、single/double 引号等），这使得解析非常复杂
一个解决方法是将最后一个数组值保存在变量中，解析 typeset -p 输出，然后将值放回
可以通过 array[-1] 访问最后一个数组值（需要 bash 4.3+）

这也以约 1.6 秒的（相对）超快速度出现（ubuntu/bash）：

$ lastv="${a[-1]}"
$ a[-1]=""
$ typeset -p a | tail -c 40 |  awk -F'[][]' '{print $(NF-1)}'
9999999
$ a[-1]="${lastv}"

函数:

这为我们提供了以下 isDense() 函数：

初始函数

unset -f isDense

isDense() {
  local last=$(typeset -p a | tail -c 40 | awk '{delim="[][]"; split($NF,arr,delim); print arr[2]}')
  [[ "${#a[@]}" = 0 || "${#a[@]}" -ne "${last}" ]]
}

最新函数（变量=最后一个数组值）

unset -f isDense

isDense() {
  local lastv="${a[-1]}"
  a[-1]=""
  local last=$(typeset -p a | tail -c 40 | awk -F'[][]' '{print $(NF-1)}')
  [[ "${#a[@]}" = 0 || "${#a[@]}" -ne "${last}" ]]
  rc=$?
  a[-1]="${lastv}"
  return "${rc}"
}

基准：

从问题（减去第 4 次测试 - 我厌倦了等待重新填充我的 1000 万条目数组）这意味着......
记住 benchmark/test 调用 isDense() 3 次
运行每个功能在 ubuntu/bash 中运行几次，这些是最好的时间...

Socowi 的函数：

real    0m11.717s
user    0m9.486s
sys     0m1.982s

oguz ismail 的功能：

real    0m10.450s
user    0m9.899s
sys     0m0.546s

我的初始 typeset|tail-c|awk 函数：

real    0m4.514s
user    0m3.574s
sys     0m1.442s

使用 Socowi 的 declare|tr|tail-n 函数的最新测试（变量=最后一个数组值）：

real    0m5.306s
user    0m4.130s
sys     0m2.670s

使用原始 typeset|tail-c|awk 函数的最新测试（变量=最后一个数组值）：

real    0m4.305s
user    0m3.247s
sys     0m1.761s

Answer 3

isDense() {
  local -rn array=""
  ((${#array[@]})) || return 0
  local -ai indices=("${!array[@]}")
  ((indices[-1] + 1 == ${#array[@]}))
}

要遵循独立的调用约定：

test() {
  local result
  isDense "" && result=dense || result=sparse 
  [[ "$result" = "" ]] ||
  echo "Test in line $BASH_LINENO failed:  array considered $result"
}

a=(); test a dense
a=(''); test a dense
a=(x x x); test a dense

a=([1]=x); test a sparse
a=([1]=); test a sparse
a=([0]=x [2]=x); test a sparse
a=([4]=x [5]=x [6]=x); test a sparse
a=([0]=x [3]=x [4]=x [13]=x); test a sparse

检查 bash 中的索引数组是稀疏的还是密集的

Check if an indexed array in bash is sparse or dense

arrays

bash

sparse-matrix

方法

Bash 脚本

可加载Bash内置