打印/查找非空字符串中的第一个字符(出现次数最少),顺序很重要
Print / Find the first character (with the LOWEST occurrence) in a non-empty string and order is important
BASH GNU bash, 版本 4.2.46(2)-release (x86_64-redhat-linux-gnu)
给定一个字符串 str,它只能存储任何小值、大值或数字值。
如何在 非空 字符串 str
中找到第一个字符(出现次数最少的)?问题的焦点是打印字母 'z' 如果脚本是这样的(尽可能快,如果数据在字符串或文件中则不会出现任何错误):https://repl.it/@asangal/find1stleastoccurrencecharmaintainorderanyleastsize
或 str
的示例 值:
str=aa
,输出应该是 'a'(因为 'a' 是字符串中唯一的一个字符 - 出现了 2 次)
str=aa1
,输出应该是'1'(因为'1'是出现次数最少的第一个字符)
str=aa1c1deef
,输出应该是 'c'(因为 'c' 出现在 'd' 之前并且两者的最低出现次数都是 1 1)
str=abcdeeddAbac
,输出应该是'A'(因为'A'是出现次数较少的第一个字符1)
str=abcdeeddAbacA
,输出应该是 'a'(因为 'a' 是出现次数较少的第一个字符 2)
str=abcdeeddAbacAabc
,输出应该是 'e'(因为 'e' 是出现次数较少的第一个字符 2)
其他大尺寸示例值可以是:
str=axavzzzfdfdsldfnasdlkfjasdlkfjaslkfjasldkfjaslfjlasjkflasdkjfasdlfjasdljfasdkjfgio23yoryoiasyfoiywoerihlkdfhlaskdnkasdnvxcnvjzxkiivhaslyqwoyroiqwyroqwroqwlkasddlkkhaslkfjasdldkfjalsdkfashoqwiyroiqwyroiqwhrkjhajkdfhaslfkhasldkfh
,输出应该是'g'(因为'g'是出现次数最少的第一个字符)
约束/上下文:
- Value 可以是 lower、UPPER 或 个数
- 字符串总是非空;我们现在可以忽略值中的任何 space 类型字符。
- 找到出现在
str
字符串 中且出现次数最少的第一个字母 ([a-zA-Z0-9]).
- 如果可能,我不想使用任何语句(例如:if-then-else)、循环(For/While)或用户定义的函数。 使用命令、库函数(如果用户开箱即用)OK。
PS: 我知道系统级命令确实会在幕后调用所有这些东西,但我正在寻找 最小代码 如果可能 在命令行 即 $ 提示。
我尝试了以下丑陋的 non-one-liner 尝试,这里我有 for循环如果可能的话我想避免和sort
命令是有帮助的但也让我失去了秩序并且没有涵盖所有条件.
我不喜欢我目前在下面列出的尝试,但看起来我很接近。
str="axavzzzfdfdsldfnasdlkfjasdlkfjaslkfjasldkfjaslfjlasjkflasdkjfasdlfjasdljfasdkjfgio23yoryoiasyfoiywoerihlkdfhlaskdnkasdnvxcnvjzxkiivhaslyqwoyroiqwyroqwroqwlkasddlkkhaslkfjasdldkfjalsdkfashoqwiyroiqwyroiqwhrkjhajkdfhaslfkhasldkfh";
for char in $(echo $str | sed "s/\(.\)/\n/g" | grep .| tr '2' ' ');
do
echo -n "$char=$(echo ${str} | sed "s/\(.\)/\n/g" | grep . | grep -c $char)";echo;
done | sort -u
我相信有可能实现我正在寻找的 One-liner(即通过使用一堆常见的 Linux 命令和管道 |) 在 BASH 中;只是想挑选你的大脑!我知道有比我更好的 shell 专家
我在网上找到的大部分解决方案都不保持顺序(这对我来说很重要)并且只是给出一个字符的highest/lower occurrence/count。
EDIT2: 如果有人需要知道最小值首次出现 character/integer 等 Input_file然后试试看
awk '
{
num=split([=10=],array,"")
for(i=1;i<=num;i++){
++count[array[i]]
}
for(j=1;j<=num;j++){
tot_ind[count[array[j]]]=(tot_ind[count[array[j]]]?tot_ind[count[array[j]]] OFS:"")array[j]
}
for(i in count){
min=min<=count[i]?(min?min:count[i]):count[i]
}
}
END{
print "Minimum value found is:" min
split(tot_ind[min],actual," ")
print "All item(s) with same minimum values are:" actual[1]
}
' Input_file
编辑: 由于 OP 出现错误,因此尽管从变量中读取,但让我们从 Input_file 中读取,如果 OP 从 Input_file 读取值,则尝试以下操作。
awk '
{
delete tot_ind
delete array
delete count
delete actual
min=""
num=split([=11=],array,"")
for(i=1;i<=num;i++){
++count[array[i]]
}
for(j=1;j<=num;j++){
tot_ind[count[array[j]]]=(tot_ind[count[array[j]]]?tot_ind[count[array[j]]] OFS:"")array[j]
}
for(i in count){
min=min<=count[i]?(min?min:count[i]):count[i]
}
print "Minimum value found is:" min
split(tot_ind[min],actual," ")
print "All item(s) with same minimum values are:" actual[1]
}' Input_file
解释:为以上添加详细解释。
awk ' ##Starting awk program from here.
{
num=split([=12=],array,"") ##Splitting current line into arrray with NULL delimiter.
for(i=1;i<=num;i++){ ##Running loop to run till num here.
++count[array[i]] ##Creating count array with index of valueof array and keep incrementing its value with 1.
}
for(j=1;j<=num;j++){ ##Running for loop till num here.
tot_ind[count[array[j]]]=(tot_ind[count[array[j]]]?tot_ind[count[array[j]]] OFS:"")array[j] ##Creating tot_ind with index of value of count array, this will have all values of minimum number here.
}
for(i in count){ ##Traversing in array count here.
min=min<=count[i]?(min?min:count[i]):count[i] ##Looking to get minimum value by comparing its value to each element.
}
print "Minimum value found is:" min ##Printing Minimum value here.
split(tot_ind[min],actual," ") ##Splitting tot_ind into actual array to get very first element of minimum value out of all values which have same minimum number.
print "All item(s) with same minimum values are:" actual[1] ##Printing very first minimum number here.
}' Input_file ##Mentioning Input_file name here.
要获得出现在 Input_file 中的第一个最小值(顺便说一下,通过此解决方案,也可以打印所有具有相同最小值的项目,在此代码的最后一个打印语句中稍作更改)。在 GNU awk
.
中编写和测试
str="abcdeeddAbacA"
awk -v str="$str" '
BEGIN{
num=split(str,array,"")
for(i=1;i<=num;i++){
++count[array[i]]
}
for(j=1;j<=num;j++){
tot_ind[count[array[j]]]=(tot_ind[count[array[j]]]?tot_ind[count[array[j]]] OFS:"")array[j]
}
for(i in count){
min=min<=count[i]?(min?min:count[i]):count[i]
}
print "Minimum value found is:" min
split(tot_ind[min],actual," ")
print "All item(s) with same minimum values are:" actual[1]
}'
概念证明: 运行 上面有 OP 的例子。
./script.ksh aa
Minimum value found is:2
All item(s) with same minimum values are:a
./script.ksh aa1
Minimum value found is:1
All item(s) with same minimum values are:1
./script.ksh aa1c1deef
Minimum value found is:1
All item(s) with same minimum values are:c
./script.ksh abcdeeddAbac
Minimum value found is:1
All item(s) with same minimum values are:A
./script.ksh abcdeeddAbacA
Minimum value found is:2
All item(s) with same minimum values are:a
./script.ksh abcdeeddAbacAabc
Minimum value found is:2
All item(s) with same minimum values are:e
注意: 我将上述解决方案保存在脚本文件中并将 OP 的示例输入作为参数传递给脚本,OP 可以在任何情况下使用他想要的方式,这样做是为了展示它是如何工作的。
尝试
grep -o . <<< ${str} | cat -n | sort -k2 | uniq -c -f1 | sort -nr -k1 -k2 | sed 's/.*[ \t]//g;$!d'
演示:
$str=aa
$grep -o . <<< ${str} | cat -n | sort -k2 | uniq -c -f1 | sort -nr -k1 -k2 | sed 's/.*[ \t]//g;$!d'
a
$str=aa1
$grep -o . <<< ${str} | cat -n | sort -k2 | uniq -c -f1 | sort -nr -k1 -k2 | sed 's/.*[ \t]//g;$!d'
1
$str=aa1c1deef
$grep -o . <<< ${str} | cat -n | sort -k2 | uniq -c -f1 | sort -nr -k1 -k2 | sed 's/.*[ \t]//g;$!d'
c
$str=abcdeeddAbac
$grep -o . <<< ${str} | cat -n | sort -k2 | uniq -c -f1 | sort -nr -k1 -k2 | sed 's/.*[ \t]//g;$!d'
A
$str=abcdeeddAbacA
$grep -o . <<< ${str} | cat -n | sort -k2 | uniq -c -f1 | sort -nr -k1 -k2 | sed 's/.*[ \t]//g;$!d'
e
$str=abcdeeddAbacAabc
$grep -o . <<< ${str} | cat -n | sort -k2 | uniq -c -f1 | sort -nr -k1 -k2 | sed 's/.*[ \t]//g;$!d'
e
$str=axavzzzfdfdsldfnasdlkfjasdlkfjaslkfjasldkfjaslfjlasjkflasdkjfasdlfjasdljfasdkjfgio23yoryoiasyfoiywoerihlkdfhlaskdnkasdnvxcnvjzxkiivhaslyqwoyroiqwyroqwroqwlkasddlkkhaslkfjasdldkfjalsdkfashoqwiyroiqwyroiqwhrkjhajkdfhaslfkhasldkf
$grep -o . <<< ${str} | cat -n | sort -k2 | uniq -c -f1 | sort -nr -k1 -k2 | sed 's/.*[ \t]//g;$!d'
g
$
编辑:在下面的例子中
str=abcdeeddAbacA, output should be 'a' (as 'a' is the first character
with lower occurrence count of 2)
ee
在 a
之前
答案 #1 - 基于 string/variable 的解决方案
假设所需的字符串存储在变量 str
中,这是一个 awk
解决方案:
awk -v str="${str}" '
BEGIN { num = split(str,token,"") # split str into an array of single letter/number elements
for ( i=1; i<=num; i++ ) { # get a count of occurrences of each letter/number
count[token[i]]++
}
min = 10000000
for ( i in count ) {
min = count[i]<min?count[i]:min # keep track of the lowest/minimum count
}
for ( i=1; i<=num; i++ ) { # loop through array of letter/numbers
if ( min == count[token[i]] ) { # for the first letter/number we find where count = min
print token[i], min # print the letter/number and count and
break # then break out of our loop
}
}
}'
运行 以上针对不同的示例字符串:
++++++++++++++++ str = aa
a 2
++++++++++++++++ str = aa1
1 1
++++++++++++++++ str = aa1c1deef
c 1
++++++++++++++++ str = abcdeeddAbac
A 1
++++++++++++++++ str = abcdeeddAbacA
a 2
++++++++++++++++ str = abcdeeddAbacAabc
e 2
++++++++++++++++ str = axavzzzfdfdsldfnasdlkfjasdlkfjaslkfjasldkfjaslfjlasjkflasdkjfasdlfjasdljfasdkjfgio23yoryoiasyfoiywoerihlkdfhlaskdnkasdnvxcnvjzxkiivhaslyqwoyroiqwyroqwroqwlkasddlkkhaslkfjasdldkfjalsdkfashoqwiyroiqwyroiqwhrkjhajkdfhaslfkhasldkfh
g 1
答案 #2 - 基于 file/array 的解决方案
查看 OP 对 RavinderSingh13 的回答的评论:一个非常大的字符串驻留在一个文件中,并假设该文件的名称是 giga.txt
...
我们应该能够对之前的 awk
解决方案进行一些小修改,例如:
awk '
BEGIN { RS = "[=12=]" } # address files with no cr/lf
{ num = split([=12=],token,"") # split line/[=12=] into an array of single letter/number elements
for( i=1; i<=num; i++ ) { # get a count of occurrences of each letter/number
all[NR i] = token[i] # token array is for current line/[=12=] while all array is for entire file
count[token[i]]++
}
}
END { min = 10000000
for ( i in count ) {
min = count[i]<min?count[i]:min # find the lowest/minimum count
}
for ( i in all ) { # loop through array of letter/numbers
if ( min == count[all[i]] ) { # for the first letter/number we find where count = min
print all[i], min # print the letter/number and count and
break # then break out of our loop
}
}
}
' giga.txt
将较长的 str
样本放入 giga.txt
:
$ cat giga.txt
axavzzzfdfdsldfnasdlkfjasdlkfjaslkfjasldkfjaslfjlasjkflasdkjfasdlfjasdljfasdkjfgio23yoryoiasyfoiywoerihlkdfhlaskdnkasdnvxcnvjzxkiivhaslyqwoyroiqwyroqwroqwlkasddlkkhaslkfjasdldkfjalsdkfashoqwiyroiqwyroiqwhrkjhajkdfhaslfkhasldkfh
运行 针对 giga.txt
的上述 awk
解决方案给我们:
$ awk '....' giga.txt
g 1
答案 #3 - 基于 file/substr() 的解决方案
OP 提供了有关如何生成 'large' 数据文件的更多详细信息:
$ ls lR / > giga.txt # I hit ^C after ~20 secs
$ sed "s/\(.\)/\n/g" giga.txt | grep -o [a-zA-Z0-9] | tr -d '2' > newgiga.txt # remove all but letters and numbers
这给了我一个1400万字符的文件(newgiga.txt
)。
我 运行 几个时间测试,以及一个新的 awk
解决方案(见下文),针对 1400 万字符的文件,并得出以下时间:
- 15 秒,使用基于 file/array 的
awk
解决方案(参见我之前的回答 - 上面)
- 25 秒与 OP
sed/grep/echo/uniq/tr/sort
回答
- 4+ 分钟使用 RavinderSingh13 的
awk
解决方案(实际上在 4 分钟后按 ^C)
- 6 秒 使用新的 file/substr() 基于
awk
的解决方案(见下文)
注意:对于针对我的特定 newgiga.txt
文件的所有解决方案 运行,最终答案是字母 Z
(出现 365 次) .
通过用一系列 substr()
调用替换 split/array
代码,并对 all
数组的索引方式做一些小改动,我能够减少 ~60%关闭前一个file/array的运行时间基于awk
解决方案:
awk '
BEGIN { RS = "[=16=]" }
{ len=length([=16=])
for( i=1; i<=len; i++ ) { # get a count of occurrences of each letter/number
token=substr([=16=],i,1)
a++
all[a] = token # token array is for current line/[=16=] while all array is for entire file
count[token]++
}
}
END { min=10000000
for( i in count ) {
min = count[i]<min?count[i]:min # find the lowest/minimum count
}
for( i in all ) { # loop through array of letter/numbers
if ( min == count[all[i]] ) { # for the first letter/number we find where count = min
print all[i], min # print the letter/number and count and
break # break out of our loop
}
}
}
' newgiga.txt
注意:老实说,我没想到 substr()
调用会比 split/array
方法更快,但我猜 awk
有一个非常快速的内置方法 运行ning substr()
调用。
好的,我想我终于明白了(从早上 5 点开始吃了 3 碗胡萝卜布丁);我有动力了!!你们大家。
- 无
for
循环或 if-then
使用条件。
- 没有在运行过程中创建变量。
- 在以下解决方案之前使用
time
- 显示它已在我在 1.5 秒 real 0m1.428s
下拥有的最大文件上完成;而 awk
解决方案使用文件,大约需要 4.5 秒)。
- 看起来更像是 单线(仅使用
Linux commands
和 |
管道)。
欢迎任何评论(如果我遗漏了任何用例)。
$ echo $str | egrep -o . | \
egrep $(echo $str | grep -o [a-zA-Z0-9] | sort | uniq -c | \
grep " $(echo $str | grep -o [a-zA-Z0-9] | sort | uniq -c| sort -n -k1 | head -1 | grep -ow " [0-9][0-9]*") " | \
sed "s/^[ \t][ \t]*//"|cut -d' ' -f2 | tr '2' '|' | sed "s/.$//") | head -1
它只会输出字母(alnum)。
如果有人想查看计数(有点超出范围),他们可以在上面的剪切命令中将 -f2
更改为 -f1
。
这是脚本:https://repl.it/@asangal/find1stleastoccurrencecharmaintainorderanyleastsize
答案 #4 - file/substr()/减少数组使用的解决方案
在与@AKS 反复讨论并使用越来越大的数据集(最新测试使用 36 MB 文件)后,awk/array
内存问题突然出现(例如,对于更大的数据集,各种 awk
答案 - 到目前为止 - 需要 6-8 GB 的 RAM)。
我解决内存问题的第一个尝试是将所有输入复制到一个新变量中;是的,这意味着将 36 MB 的数据复制到一个 awk
变量中,但这仍然比 6-8 GB 的 RAM 少很多。
使用@AKS 提供的新(更大)数据集:
$ str="upvouapouqweruuUPOUADUFUAUASDFLKHLPlakjfaldsfpuFHAOOJJADFIASYDOYsggdhuafaismxasidfuasudfoasdufoiasudfoiayioOISYDOIQYORIYOIRYOIYQNOIYFAclamscvjlaivniauppruporupourpoupupovupuadouaouuouaudfaodfpadufuudofupuaspfupipoporqwooPOFPUnmcupauvpaupvouapouqweruuUPOUADUFUAUASDFLKHLPlakjfaldsfpuFHAOOJJADFIASYDOYsggdhuafaismxasidfuasudfoasdufoiasudfoiayioOISYDOIQYORIYOIRYOIYQNOIYFAclamscvjlaivniauppruporupourpoupupovupuadouaouuouaudfaodfpadufuudofupuaspfupipoporqwooPOFPUnmcupauvpaupvouapouqweruuUPOUADUFUAUASDFLKHLPlakjfaldsfpuFHAOOJJADFIASYDOYsggdhuafaismxasidfuasudfoasdufoiasudfoiayioOISYDOIQYORIYOIRYOIYQNOIYFAclamscvjlaivniauppruporupourpoupupovupuadouaouuouaudfaodfpadufuudofupuaspfupipoporqwooPOFPUnmcupauvpaupvouapouqweruuUPOUADUFUAUASDFLKHLPlakjfaldsfpuFHAOOJJADFIASYDOYsggdhuafaismxasidfuasudfoasdufoiasudfoiayioOISYDOIQYORIYOIRYOIYQNOIYFAclamscvjlaivniauppruporupourpoupupovupuadouaouuouaudfaodfpadufuudofupuaspfupipoporqwooPOFPUnmcupauvpaupvouapouqweruuUPOUADUFUAUASDFLKHLPlakjfaldsfpuFHAOOJJADFIASYDOYsggdhuafaismxasidfuasudfoasdufoiasudfoiayioOISYDOIQYORIYOIRYOIYQNOIYFAclamscvjlaivniauppruporupourpoupupovupuadouaouuouaudfaodfpadufuudofupuaspfupipoporqwooPOFPUnmcupauvpaupvouapouqweruuUPOUADUFUAUASDFLKHLPlakjfaldsfpuFHAOOJJADFIASYDOYsggdhuafaismxasidfuasudfoasdufoiasudfoiayioOISYDOIQYORIYOIRYOIYQNOIYFAclamscvjlaivniauppruporupourpoupupovupuadouaouuouaudfaodfpadufuudofupuaspfupipoporqwooPOFPUnmcupauvpaupvouapouqweruuUPOUADUFUAUASDFLKHLPlakjfaldsfpuFHAOOJJADFIASYDOYsggdhuafaismxasidfuasudfoasdufoiasudfoiayioOISYDOIQYORIYOIRYOIYQNOIYFAclamscvjlaivniauppruporupourpoupupovupuadouaouuouaudfaodfpadufuudofupuaspfupipoporqwooPOFPUnmcupauvpaupvouapouqweruuUPOUADUFUAUASDFLKHLPlakjfaldsfpuFHAOOJJADFIASYDOYsggdhuafaismxasidfuasudfoasdufoiasudfoiayioOISYDOIQYORIYOIRYOIYQNOIYFAclamscvjlaivniauppruporupourpoupupovupuadouaouuouaudfaodfpadufuudofupuaspfupipoporqwooPOFPUnmcupauvpaupvouapouqweruuUPOUADUFUAUASDFLKHLPlakjfaldsfpuFHAOOJJADFIASYDOYsggdhuafaismxasidfuasudfoasdufoiasudfoiayioOISYDOIQYORIYOIRYOIYQNOIYFAclamscvjlaivniauppruporupourpoupupovupuadouaouuouaudfaodfpadufuudofupuaspfupipoporqwooPOFPUnmcupauvpaupvouapouqweruuUPOUADUFUAUASDFLKHLPlakjfaldsfpuFHAOOJJADFIASYDOYsggdhuafaismxasidfuasudfoasdufoiasudfoiayioOISYDOIQYORIYOIRYOIYQNOIYFAclamscvjlaivniauppruporupourpoupupovupuadouaouuouaudfaodfpadufuudofupuaspfupipoporqwooPOFPUnmcupauvpaupvouapouqweruuUPOUADUFUAUASDFLKHLPlakjfaldsfpuFHAOOJJADFIASYDOYsggdhuafaismxasidfuasudfoasdufoiasudfoiayioOISYDOIQYORIYOIRYOIYQNOIYFAclamscvjlaivniauppruporupourpoupupovupuadouaouuouaudfaodfpadufuudofupuaspfupipoporqwooPOFPUnmcupauvpaupvouapouqweruuUPOUADUFUAUASDFLKHLPlakjfaldsfpuFHAOOJJADFIASYDOYsggdhuafaismxasidfuasudfoasdufoiasudfoiayioOISYDOIQYORIYOIRYOIYQNOIYFAclamscvjlaivniauppruporupourpoupupovupuadouaouuouaudfaodfpadufuudofupuaspfupipoporqwooPOFPUnmcupauvpaupvouapouqweruuUPOUADUFUAUASDFLKHLPlakjfaldsfpuFHAOOJJADFIASYDOYsggdhuafaismxasidfuasudfoasdufoiasudfoiayioOISYDOIQYORIYOIRYOIYQNOIYFAclamscvjlaivniauppruporupourpoupupovupuadouaouuouaudfaodfpadufuudofupuaspfupipoporqwooPOFPUnmcupauvpaupvouapouqweruuUPOUADUFUAUASDFLKHLPlakjfaldsfpuFHAOOJJADFIASYDOYsggdhuafaismxasidfuasudfoasdufoiasudfoiayioOISYDOIQYORIYOIRYOIYQNOIYFAclamscvjlaivniauppruporupourpoupupovupuadouaouuouaudfaodfpadufuudofupuaspfupipoporqwooPOFPUnmcupauvpaupvouapouqweruuUPOUADUFUAUASDFLKHLPlakjfaldsfpuFHAOOJJADFIASYDOYsggdhuafaismxasidfuasudfoasdufoiasudfoiayioOISYDOIQYORIYOIRYOIYQNOIYFAclamscvjlaivniauppruporupourpoupupovupuadouaouuouaudfaodfpadufuudofupuaspfupipoporqwooPOFPUnmcupauvpaupvouapouqweruuUPOUADUFUAUASDFLKHLPlakjfaldsfpuFHAOOJJADFIASYDOYsggdhuafaismxasidfuasudfoasdufoiasudfoiayioOISYDOIQYORIYOIRYOIYQNOIYFAclamscvjlaivniauppruporupourpoupupovupuadouaouuouaudfaodfpadufuudofupuaspfupipoporqwooPOFPUnmcupauvpaupvouapouqweruuUPOUADUFUAUASDFLKHLPlakjfaldsfpuFHAOOJJADFIASYDOYsggdhuafaismxasidfuasudfoasdufoiasudfoiayioOISYDOIQYORIYOIRYOIYQNOIYFAclamscvjlaivniauppruporupourpoupupovupuadouaouuouaudfaodfpadufuudofupuaspfupipoporqwooPOFPUnmcupauvpaupvouapouqweruuUPOUADUFUAUASDFLKHLPlakjfaldsfpuFHAOOJJADFIASYDOYsggdhuafaismxasidfuasudfoasdufoiasudfoiayioOISYDOIQYORIYOIRYOIYQNOIYFAclamscvjlaivniauppruporupourpoupupovupuadouaouuouaudfaodfpadufuudofupuaspfupipoporqwooPOFPUnmcupauvpaupvouapouqweruuUPOUADUFUAUASDFLKHLPlakjfaldsfpuFHAOOJJADFIASYDOYsggdhuafaismxasidfuasudfoasdufoiasudfoiayioOISYDOIQYORIYOIRYOIYQNOIYFAclamscvjlaivniauppruporupourpoupupovupuadouaouuouaudfaodfpadufuudofupuaspfupipoporqwooPOFPUnmcupauvpaupvouapouqweruuUPOUADUFUAUASDFLKHLP"
$ for i in {1..10}; do str="${str}${str}"; done
$ for i in {1..3}; do str="${str}${str}"; done
$ echo -e "\n\n-- Adding 'z' the only char in this big string blob 'str' variable'\n"
$ str="${str}z"
$ echo $str | wc
1 1 36864002
$ echo "${str}" > newgiga.txt
$ ls -lh newgiga.txt
-rw-r--r--+ 1 xxxxx yyyyy 36M Jun 6 16:55 newgiga.txt
注意:创建此数据的方式,除了字母 z
(仅出现一次,并在整个数据集的末尾)。
和new/improvedawk
解决方案:
$ time awk '
{ copy = copy [=11=] # make a copy of our input for later reparsing
len = length([=11=])
for ( i=1; i<=len; i++ ) { # get a count of occurrences of each letter/number
token = substr([=11=],i,1)
count[token]++
}
}
END { for ( i in count ) {
if ( min <= 0 )
min = count[i]
else
min = count[i]<min?count[i]:min # find the lowest/minimum count
}
for ( i=1; i<=len; i++ ) { # reparse input looking for first letter with count == min
token = substr(copy,i,1)
if ( min == count[token] ) {
print token, min # print the letter/number and count and
break # break out of our loop
}
}
}
' newgiga.txt
z 1 # as mentioned in the above NOTE => z occurs just once in the dataset
real 0m19.575s # slightly better rate than the previous answer #3 that took 6 secs for 14 MB of data
user 0m19.406s
sys 0m0.171s
注意:这个答案用掉了我机器上的 160 MB 内存(比之前答案的 6-8 GB 好很多)同时 运行ning 在和以前差不多。
尝试了一种消除 copy
变量并再次处理输入文件的解决方案。结果:
- 总内存使用量下降了 ~30 MB(至 ~130 MB)
- 总 运行 时间增加了 ~2 秒
因此,权衡并不值得付出努力。
如果文件适合内存:
<file tr -dc '[:alnum:]' | perl -ln0777e 'while (($c=substr $_,0,1) ne q{}) {$n=eval "y/\Q$c\E//d"; $count{$n}=$count{$n}.$c} END{for (sort {$a <=> $b} keys %count) {print substr $count{$_},0,1; exit}}'
BASH GNU bash, 版本 4.2.46(2)-release (x86_64-redhat-linux-gnu)
给定一个字符串 str,它只能存储任何小值、大值或数字值。
如何在 非空 字符串 str
中找到第一个字符(出现次数最少的)?问题的焦点是打印字母 'z' 如果脚本是这样的(尽可能快,如果数据在字符串或文件中则不会出现任何错误):https://repl.it/@asangal/find1stleastoccurrencecharmaintainorderanyleastsize
或 str
的示例 值:
str=aa
,输出应该是 'a'(因为 'a' 是字符串中唯一的一个字符 - 出现了 2 次)
str=aa1
,输出应该是'1'(因为'1'是出现次数最少的第一个字符)
str=aa1c1deef
,输出应该是 'c'(因为 'c' 出现在 'd' 之前并且两者的最低出现次数都是 1 1)
str=abcdeeddAbac
,输出应该是'A'(因为'A'是出现次数较少的第一个字符1)
str=abcdeeddAbacA
,输出应该是 'a'(因为 'a' 是出现次数较少的第一个字符 2)
str=abcdeeddAbacAabc
,输出应该是 'e'(因为 'e' 是出现次数较少的第一个字符 2)
其他大尺寸示例值可以是:
str=axavzzzfdfdsldfnasdlkfjasdlkfjaslkfjasldkfjaslfjlasjkflasdkjfasdlfjasdljfasdkjfgio23yoryoiasyfoiywoerihlkdfhlaskdnkasdnvxcnvjzxkiivhaslyqwoyroiqwyroqwroqwlkasddlkkhaslkfjasdldkfjalsdkfashoqwiyroiqwyroiqwhrkjhajkdfhaslfkhasldkfh
,输出应该是'g'(因为'g'是出现次数最少的第一个字符)
约束/上下文:
- Value 可以是 lower、UPPER 或 个数
- 字符串总是非空;我们现在可以忽略值中的任何 space 类型字符。
- 找到出现在
str
字符串 中且出现次数最少的第一个字母 ([a-zA-Z0-9]). - 如果可能,我不想使用任何语句(例如:if-then-else)、循环(For/While)或用户定义的函数。 使用命令、库函数(如果用户开箱即用)OK。
PS: 我知道系统级命令确实会在幕后调用所有这些东西,但我正在寻找 最小代码 如果可能 在命令行 即 $ 提示。
我尝试了以下丑陋的 non-one-liner 尝试,这里我有 for循环如果可能的话我想避免和sort
命令是有帮助的但也让我失去了秩序并且没有涵盖所有条件.
我不喜欢我目前在下面列出的尝试,但看起来我很接近。
str="axavzzzfdfdsldfnasdlkfjasdlkfjaslkfjasldkfjaslfjlasjkflasdkjfasdlfjasdljfasdkjfgio23yoryoiasyfoiywoerihlkdfhlaskdnkasdnvxcnvjzxkiivhaslyqwoyroiqwyroqwroqwlkasddlkkhaslkfjasdldkfjalsdkfashoqwiyroiqwyroiqwhrkjhajkdfhaslfkhasldkfh";
for char in $(echo $str | sed "s/\(.\)/\n/g" | grep .| tr '2' ' ');
do
echo -n "$char=$(echo ${str} | sed "s/\(.\)/\n/g" | grep . | grep -c $char)";echo;
done | sort -u
我相信有可能实现我正在寻找的 One-liner(即通过使用一堆常见的 Linux 命令和管道 |) 在 BASH 中;只是想挑选你的大脑!我知道有比我更好的 shell 专家
我在网上找到的大部分解决方案都不保持顺序(这对我来说很重要)并且只是给出一个字符的highest/lower occurrence/count。
EDIT2: 如果有人需要知道最小值首次出现 character/integer 等 Input_file然后试试看
awk '
{
num=split([=10=],array,"")
for(i=1;i<=num;i++){
++count[array[i]]
}
for(j=1;j<=num;j++){
tot_ind[count[array[j]]]=(tot_ind[count[array[j]]]?tot_ind[count[array[j]]] OFS:"")array[j]
}
for(i in count){
min=min<=count[i]?(min?min:count[i]):count[i]
}
}
END{
print "Minimum value found is:" min
split(tot_ind[min],actual," ")
print "All item(s) with same minimum values are:" actual[1]
}
' Input_file
编辑: 由于 OP 出现错误,因此尽管从变量中读取,但让我们从 Input_file 中读取,如果 OP 从 Input_file 读取值,则尝试以下操作。
awk '
{
delete tot_ind
delete array
delete count
delete actual
min=""
num=split([=11=],array,"")
for(i=1;i<=num;i++){
++count[array[i]]
}
for(j=1;j<=num;j++){
tot_ind[count[array[j]]]=(tot_ind[count[array[j]]]?tot_ind[count[array[j]]] OFS:"")array[j]
}
for(i in count){
min=min<=count[i]?(min?min:count[i]):count[i]
}
print "Minimum value found is:" min
split(tot_ind[min],actual," ")
print "All item(s) with same minimum values are:" actual[1]
}' Input_file
解释:为以上添加详细解释。
awk ' ##Starting awk program from here.
{
num=split([=12=],array,"") ##Splitting current line into arrray with NULL delimiter.
for(i=1;i<=num;i++){ ##Running loop to run till num here.
++count[array[i]] ##Creating count array with index of valueof array and keep incrementing its value with 1.
}
for(j=1;j<=num;j++){ ##Running for loop till num here.
tot_ind[count[array[j]]]=(tot_ind[count[array[j]]]?tot_ind[count[array[j]]] OFS:"")array[j] ##Creating tot_ind with index of value of count array, this will have all values of minimum number here.
}
for(i in count){ ##Traversing in array count here.
min=min<=count[i]?(min?min:count[i]):count[i] ##Looking to get minimum value by comparing its value to each element.
}
print "Minimum value found is:" min ##Printing Minimum value here.
split(tot_ind[min],actual," ") ##Splitting tot_ind into actual array to get very first element of minimum value out of all values which have same minimum number.
print "All item(s) with same minimum values are:" actual[1] ##Printing very first minimum number here.
}' Input_file ##Mentioning Input_file name here.
要获得出现在 Input_file 中的第一个最小值(顺便说一下,通过此解决方案,也可以打印所有具有相同最小值的项目,在此代码的最后一个打印语句中稍作更改)。在 GNU awk
.
str="abcdeeddAbacA"
awk -v str="$str" '
BEGIN{
num=split(str,array,"")
for(i=1;i<=num;i++){
++count[array[i]]
}
for(j=1;j<=num;j++){
tot_ind[count[array[j]]]=(tot_ind[count[array[j]]]?tot_ind[count[array[j]]] OFS:"")array[j]
}
for(i in count){
min=min<=count[i]?(min?min:count[i]):count[i]
}
print "Minimum value found is:" min
split(tot_ind[min],actual," ")
print "All item(s) with same minimum values are:" actual[1]
}'
概念证明: 运行 上面有 OP 的例子。
./script.ksh aa
Minimum value found is:2
All item(s) with same minimum values are:a
./script.ksh aa1
Minimum value found is:1
All item(s) with same minimum values are:1
./script.ksh aa1c1deef
Minimum value found is:1
All item(s) with same minimum values are:c
./script.ksh abcdeeddAbac
Minimum value found is:1
All item(s) with same minimum values are:A
./script.ksh abcdeeddAbacA
Minimum value found is:2
All item(s) with same minimum values are:a
./script.ksh abcdeeddAbacAabc
Minimum value found is:2
All item(s) with same minimum values are:e
注意: 我将上述解决方案保存在脚本文件中并将 OP 的示例输入作为参数传递给脚本,OP 可以在任何情况下使用他想要的方式,这样做是为了展示它是如何工作的。
尝试
grep -o . <<< ${str} | cat -n | sort -k2 | uniq -c -f1 | sort -nr -k1 -k2 | sed 's/.*[ \t]//g;$!d'
演示:
$str=aa
$grep -o . <<< ${str} | cat -n | sort -k2 | uniq -c -f1 | sort -nr -k1 -k2 | sed 's/.*[ \t]//g;$!d'
a
$str=aa1
$grep -o . <<< ${str} | cat -n | sort -k2 | uniq -c -f1 | sort -nr -k1 -k2 | sed 's/.*[ \t]//g;$!d'
1
$str=aa1c1deef
$grep -o . <<< ${str} | cat -n | sort -k2 | uniq -c -f1 | sort -nr -k1 -k2 | sed 's/.*[ \t]//g;$!d'
c
$str=abcdeeddAbac
$grep -o . <<< ${str} | cat -n | sort -k2 | uniq -c -f1 | sort -nr -k1 -k2 | sed 's/.*[ \t]//g;$!d'
A
$str=abcdeeddAbacA
$grep -o . <<< ${str} | cat -n | sort -k2 | uniq -c -f1 | sort -nr -k1 -k2 | sed 's/.*[ \t]//g;$!d'
e
$str=abcdeeddAbacAabc
$grep -o . <<< ${str} | cat -n | sort -k2 | uniq -c -f1 | sort -nr -k1 -k2 | sed 's/.*[ \t]//g;$!d'
e
$str=axavzzzfdfdsldfnasdlkfjasdlkfjaslkfjasldkfjaslfjlasjkflasdkjfasdlfjasdljfasdkjfgio23yoryoiasyfoiywoerihlkdfhlaskdnkasdnvxcnvjzxkiivhaslyqwoyroiqwyroqwroqwlkasddlkkhaslkfjasdldkfjalsdkfashoqwiyroiqwyroiqwhrkjhajkdfhaslfkhasldkf
$grep -o . <<< ${str} | cat -n | sort -k2 | uniq -c -f1 | sort -nr -k1 -k2 | sed 's/.*[ \t]//g;$!d'
g
$
编辑:在下面的例子中
str=abcdeeddAbacA, output should be 'a' (as 'a' is the first character with lower occurrence count of 2)
ee
在 a
答案 #1 - 基于 string/variable 的解决方案
假设所需的字符串存储在变量 str
中,这是一个 awk
解决方案:
awk -v str="${str}" '
BEGIN { num = split(str,token,"") # split str into an array of single letter/number elements
for ( i=1; i<=num; i++ ) { # get a count of occurrences of each letter/number
count[token[i]]++
}
min = 10000000
for ( i in count ) {
min = count[i]<min?count[i]:min # keep track of the lowest/minimum count
}
for ( i=1; i<=num; i++ ) { # loop through array of letter/numbers
if ( min == count[token[i]] ) { # for the first letter/number we find where count = min
print token[i], min # print the letter/number and count and
break # then break out of our loop
}
}
}'
运行 以上针对不同的示例字符串:
++++++++++++++++ str = aa
a 2
++++++++++++++++ str = aa1
1 1
++++++++++++++++ str = aa1c1deef
c 1
++++++++++++++++ str = abcdeeddAbac
A 1
++++++++++++++++ str = abcdeeddAbacA
a 2
++++++++++++++++ str = abcdeeddAbacAabc
e 2
++++++++++++++++ str = axavzzzfdfdsldfnasdlkfjasdlkfjaslkfjasldkfjaslfjlasjkflasdkjfasdlfjasdljfasdkjfgio23yoryoiasyfoiywoerihlkdfhlaskdnkasdnvxcnvjzxkiivhaslyqwoyroiqwyroqwroqwlkasddlkkhaslkfjasdldkfjalsdkfashoqwiyroiqwyroiqwhrkjhajkdfhaslfkhasldkfh
g 1
答案 #2 - 基于 file/array 的解决方案
查看 OP 对 RavinderSingh13 的回答的评论:一个非常大的字符串驻留在一个文件中,并假设该文件的名称是 giga.txt
...
我们应该能够对之前的 awk
解决方案进行一些小修改,例如:
awk '
BEGIN { RS = "[=12=]" } # address files with no cr/lf
{ num = split([=12=],token,"") # split line/[=12=] into an array of single letter/number elements
for( i=1; i<=num; i++ ) { # get a count of occurrences of each letter/number
all[NR i] = token[i] # token array is for current line/[=12=] while all array is for entire file
count[token[i]]++
}
}
END { min = 10000000
for ( i in count ) {
min = count[i]<min?count[i]:min # find the lowest/minimum count
}
for ( i in all ) { # loop through array of letter/numbers
if ( min == count[all[i]] ) { # for the first letter/number we find where count = min
print all[i], min # print the letter/number and count and
break # then break out of our loop
}
}
}
' giga.txt
将较长的 str
样本放入 giga.txt
:
$ cat giga.txt
axavzzzfdfdsldfnasdlkfjasdlkfjaslkfjasldkfjaslfjlasjkflasdkjfasdlfjasdljfasdkjfgio23yoryoiasyfoiywoerihlkdfhlaskdnkasdnvxcnvjzxkiivhaslyqwoyroiqwyroqwroqwlkasddlkkhaslkfjasdldkfjalsdkfashoqwiyroiqwyroiqwhrkjhajkdfhaslfkhasldkfh
运行 针对 giga.txt
的上述 awk
解决方案给我们:
$ awk '....' giga.txt
g 1
答案 #3 - 基于 file/substr() 的解决方案
OP 提供了有关如何生成 'large' 数据文件的更多详细信息:
$ ls lR / > giga.txt # I hit ^C after ~20 secs
$ sed "s/\(.\)/\n/g" giga.txt | grep -o [a-zA-Z0-9] | tr -d '2' > newgiga.txt # remove all but letters and numbers
这给了我一个1400万字符的文件(newgiga.txt
)。
我 运行 几个时间测试,以及一个新的 awk
解决方案(见下文),针对 1400 万字符的文件,并得出以下时间:
- 15 秒,使用基于 file/array 的
awk
解决方案(参见我之前的回答 - 上面) - 25 秒与 OP
sed/grep/echo/uniq/tr/sort
回答 - 4+ 分钟使用 RavinderSingh13 的
awk
解决方案(实际上在 4 分钟后按 ^C) - 6 秒 使用新的 file/substr() 基于
awk
的解决方案(见下文)
注意:对于针对我的特定 newgiga.txt
文件的所有解决方案 运行,最终答案是字母 Z
(出现 365 次) .
通过用一系列 substr()
调用替换 split/array
代码,并对 all
数组的索引方式做一些小改动,我能够减少 ~60%关闭前一个file/array的运行时间基于awk
解决方案:
awk '
BEGIN { RS = "[=16=]" }
{ len=length([=16=])
for( i=1; i<=len; i++ ) { # get a count of occurrences of each letter/number
token=substr([=16=],i,1)
a++
all[a] = token # token array is for current line/[=16=] while all array is for entire file
count[token]++
}
}
END { min=10000000
for( i in count ) {
min = count[i]<min?count[i]:min # find the lowest/minimum count
}
for( i in all ) { # loop through array of letter/numbers
if ( min == count[all[i]] ) { # for the first letter/number we find where count = min
print all[i], min # print the letter/number and count and
break # break out of our loop
}
}
}
' newgiga.txt
注意:老实说,我没想到 substr()
调用会比 split/array
方法更快,但我猜 awk
有一个非常快速的内置方法 运行ning substr()
调用。
好的,我想我终于明白了(从早上 5 点开始吃了 3 碗胡萝卜布丁);我有动力了!!你们大家。
- 无
for
循环或if-then
使用条件。 - 没有在运行过程中创建变量。
- 在以下解决方案之前使用
time
- 显示它已在我在 1.5 秒real 0m1.428s
下拥有的最大文件上完成;而awk
解决方案使用文件,大约需要 4.5 秒)。 - 看起来更像是 单线(仅使用
Linux commands
和|
管道)。
欢迎任何评论(如果我遗漏了任何用例)。
$ echo $str | egrep -o . | \
egrep $(echo $str | grep -o [a-zA-Z0-9] | sort | uniq -c | \
grep " $(echo $str | grep -o [a-zA-Z0-9] | sort | uniq -c| sort -n -k1 | head -1 | grep -ow " [0-9][0-9]*") " | \
sed "s/^[ \t][ \t]*//"|cut -d' ' -f2 | tr '2' '|' | sed "s/.$//") | head -1
它只会输出字母(alnum)。
如果有人想查看计数(有点超出范围),他们可以在上面的剪切命令中将 -f2
更改为 -f1
。
这是脚本:https://repl.it/@asangal/find1stleastoccurrencecharmaintainorderanyleastsize
答案 #4 - file/substr()/减少数组使用的解决方案
在与@AKS 反复讨论并使用越来越大的数据集(最新测试使用 36 MB 文件)后,awk/array
内存问题突然出现(例如,对于更大的数据集,各种 awk
答案 - 到目前为止 - 需要 6-8 GB 的 RAM)。
我解决内存问题的第一个尝试是将所有输入复制到一个新变量中;是的,这意味着将 36 MB 的数据复制到一个 awk
变量中,但这仍然比 6-8 GB 的 RAM 少很多。
使用@AKS 提供的新(更大)数据集:
$ str="upvouapouqweruuUPOUADUFUAUASDFLKHLPlakjfaldsfpuFHAOOJJADFIASYDOYsggdhuafaismxasidfuasudfoasdufoiasudfoiayioOISYDOIQYORIYOIRYOIYQNOIYFAclamscvjlaivniauppruporupourpoupupovupuadouaouuouaudfaodfpadufuudofupuaspfupipoporqwooPOFPUnmcupauvpaupvouapouqweruuUPOUADUFUAUASDFLKHLPlakjfaldsfpuFHAOOJJADFIASYDOYsggdhuafaismxasidfuasudfoasdufoiasudfoiayioOISYDOIQYORIYOIRYOIYQNOIYFAclamscvjlaivniauppruporupourpoupupovupuadouaouuouaudfaodfpadufuudofupuaspfupipoporqwooPOFPUnmcupauvpaupvouapouqweruuUPOUADUFUAUASDFLKHLPlakjfaldsfpuFHAOOJJADFIASYDOYsggdhuafaismxasidfuasudfoasdufoiasudfoiayioOISYDOIQYORIYOIRYOIYQNOIYFAclamscvjlaivniauppruporupourpoupupovupuadouaouuouaudfaodfpadufuudofupuaspfupipoporqwooPOFPUnmcupauvpaupvouapouqweruuUPOUADUFUAUASDFLKHLPlakjfaldsfpuFHAOOJJADFIASYDOYsggdhuafaismxasidfuasudfoasdufoiasudfoiayioOISYDOIQYORIYOIRYOIYQNOIYFAclamscvjlaivniauppruporupourpoupupovupuadouaouuouaudfaodfpadufuudofupuaspfupipoporqwooPOFPUnmcupauvpaupvouapouqweruuUPOUADUFUAUASDFLKHLPlakjfaldsfpuFHAOOJJADFIASYDOYsggdhuafaismxasidfuasudfoasdufoiasudfoiayioOISYDOIQYORIYOIRYOIYQNOIYFAclamscvjlaivniauppruporupourpoupupovupuadouaouuouaudfaodfpadufuudofupuaspfupipoporqwooPOFPUnmcupauvpaupvouapouqweruuUPOUADUFUAUASDFLKHLPlakjfaldsfpuFHAOOJJADFIASYDOYsggdhuafaismxasidfuasudfoasdufoiasudfoiayioOISYDOIQYORIYOIRYOIYQNOIYFAclamscvjlaivniauppruporupourpoupupovupuadouaouuouaudfaodfpadufuudofupuaspfupipoporqwooPOFPUnmcupauvpaupvouapouqweruuUPOUADUFUAUASDFLKHLPlakjfaldsfpuFHAOOJJADFIASYDOYsggdhuafaismxasidfuasudfoasdufoiasudfoiayioOISYDOIQYORIYOIRYOIYQNOIYFAclamscvjlaivniauppruporupourpoupupovupuadouaouuouaudfaodfpadufuudofupuaspfupipoporqwooPOFPUnmcupauvpaupvouapouqweruuUPOUADUFUAUASDFLKHLPlakjfaldsfpuFHAOOJJADFIASYDOYsggdhuafaismxasidfuasudfoasdufoiasudfoiayioOISYDOIQYORIYOIRYOIYQNOIYFAclamscvjlaivniauppruporupourpoupupovupuadouaouuouaudfaodfpadufuudofupuaspfupipoporqwooPOFPUnmcupauvpaupvouapouqweruuUPOUADUFUAUASDFLKHLPlakjfaldsfpuFHAOOJJADFIASYDOYsggdhuafaismxasidfuasudfoasdufoiasudfoiayioOISYDOIQYORIYOIRYOIYQNOIYFAclamscvjlaivniauppruporupourpoupupovupuadouaouuouaudfaodfpadufuudofupuaspfupipoporqwooPOFPUnmcupauvpaupvouapouqweruuUPOUADUFUAUASDFLKHLPlakjfaldsfpuFHAOOJJADFIASYDOYsggdhuafaismxasidfuasudfoasdufoiasudfoiayioOISYDOIQYORIYOIRYOIYQNOIYFAclamscvjlaivniauppruporupourpoupupovupuadouaouuouaudfaodfpadufuudofupuaspfupipoporqwooPOFPUnmcupauvpaupvouapouqweruuUPOUADUFUAUASDFLKHLPlakjfaldsfpuFHAOOJJADFIASYDOYsggdhuafaismxasidfuasudfoasdufoiasudfoiayioOISYDOIQYORIYOIRYOIYQNOIYFAclamscvjlaivniauppruporupourpoupupovupuadouaouuouaudfaodfpadufuudofupuaspfupipoporqwooPOFPUnmcupauvpaupvouapouqweruuUPOUADUFUAUASDFLKHLPlakjfaldsfpuFHAOOJJADFIASYDOYsggdhuafaismxasidfuasudfoasdufoiasudfoiayioOISYDOIQYORIYOIRYOIYQNOIYFAclamscvjlaivniauppruporupourpoupupovupuadouaouuouaudfaodfpadufuudofupuaspfupipoporqwooPOFPUnmcupauvpaupvouapouqweruuUPOUADUFUAUASDFLKHLPlakjfaldsfpuFHAOOJJADFIASYDOYsggdhuafaismxasidfuasudfoasdufoiasudfoiayioOISYDOIQYORIYOIRYOIYQNOIYFAclamscvjlaivniauppruporupourpoupupovupuadouaouuouaudfaodfpadufuudofupuaspfupipoporqwooPOFPUnmcupauvpaupvouapouqweruuUPOUADUFUAUASDFLKHLPlakjfaldsfpuFHAOOJJADFIASYDOYsggdhuafaismxasidfuasudfoasdufoiasudfoiayioOISYDOIQYORIYOIRYOIYQNOIYFAclamscvjlaivniauppruporupourpoupupovupuadouaouuouaudfaodfpadufuudofupuaspfupipoporqwooPOFPUnmcupauvpaupvouapouqweruuUPOUADUFUAUASDFLKHLPlakjfaldsfpuFHAOOJJADFIASYDOYsggdhuafaismxasidfuasudfoasdufoiasudfoiayioOISYDOIQYORIYOIRYOIYQNOIYFAclamscvjlaivniauppruporupourpoupupovupuadouaouuouaudfaodfpadufuudofupuaspfupipoporqwooPOFPUnmcupauvpaupvouapouqweruuUPOUADUFUAUASDFLKHLPlakjfaldsfpuFHAOOJJADFIASYDOYsggdhuafaismxasidfuasudfoasdufoiasudfoiayioOISYDOIQYORIYOIRYOIYQNOIYFAclamscvjlaivniauppruporupourpoupupovupuadouaouuouaudfaodfpadufuudofupuaspfupipoporqwooPOFPUnmcupauvpaupvouapouqweruuUPOUADUFUAUASDFLKHLPlakjfaldsfpuFHAOOJJADFIASYDOYsggdhuafaismxasidfuasudfoasdufoiasudfoiayioOISYDOIQYORIYOIRYOIYQNOIYFAclamscvjlaivniauppruporupourpoupupovupuadouaouuouaudfaodfpadufuudofupuaspfupipoporqwooPOFPUnmcupauvpaupvouapouqweruuUPOUADUFUAUASDFLKHLPlakjfaldsfpuFHAOOJJADFIASYDOYsggdhuafaismxasidfuasudfoasdufoiasudfoiayioOISYDOIQYORIYOIRYOIYQNOIYFAclamscvjlaivniauppruporupourpoupupovupuadouaouuouaudfaodfpadufuudofupuaspfupipoporqwooPOFPUnmcupauvpaupvouapouqweruuUPOUADUFUAUASDFLKHLPlakjfaldsfpuFHAOOJJADFIASYDOYsggdhuafaismxasidfuasudfoasdufoiasudfoiayioOISYDOIQYORIYOIRYOIYQNOIYFAclamscvjlaivniauppruporupourpoupupovupuadouaouuouaudfaodfpadufuudofupuaspfupipoporqwooPOFPUnmcupauvpaupvouapouqweruuUPOUADUFUAUASDFLKHLP"
$ for i in {1..10}; do str="${str}${str}"; done
$ for i in {1..3}; do str="${str}${str}"; done
$ echo -e "\n\n-- Adding 'z' the only char in this big string blob 'str' variable'\n"
$ str="${str}z"
$ echo $str | wc
1 1 36864002
$ echo "${str}" > newgiga.txt
$ ls -lh newgiga.txt
-rw-r--r--+ 1 xxxxx yyyyy 36M Jun 6 16:55 newgiga.txt
注意:创建此数据的方式,除了字母 z
(仅出现一次,并在整个数据集的末尾)。
和new/improvedawk
解决方案:
$ time awk '
{ copy = copy [=11=] # make a copy of our input for later reparsing
len = length([=11=])
for ( i=1; i<=len; i++ ) { # get a count of occurrences of each letter/number
token = substr([=11=],i,1)
count[token]++
}
}
END { for ( i in count ) {
if ( min <= 0 )
min = count[i]
else
min = count[i]<min?count[i]:min # find the lowest/minimum count
}
for ( i=1; i<=len; i++ ) { # reparse input looking for first letter with count == min
token = substr(copy,i,1)
if ( min == count[token] ) {
print token, min # print the letter/number and count and
break # break out of our loop
}
}
}
' newgiga.txt
z 1 # as mentioned in the above NOTE => z occurs just once in the dataset
real 0m19.575s # slightly better rate than the previous answer #3 that took 6 secs for 14 MB of data
user 0m19.406s
sys 0m0.171s
注意:这个答案用掉了我机器上的 160 MB 内存(比之前答案的 6-8 GB 好很多)同时 运行ning 在和以前差不多。
尝试了一种消除 copy
变量并再次处理输入文件的解决方案。结果:
- 总内存使用量下降了 ~30 MB(至 ~130 MB)
- 总 运行 时间增加了 ~2 秒
因此,权衡并不值得付出努力。
如果文件适合内存:
<file tr -dc '[:alnum:]' | perl -ln0777e 'while (($c=substr $_,0,1) ne q{}) {$n=eval "y/\Q$c\E//d"; $count{$n}=$count{$n}.$c} END{for (sort {$a <=> $b} keys %count) {print substr $count{$_},0,1; exit}}'