使用 uniq -c 时如何访问前缀

Question

我的程序遇到了问题。我有一个文件列表，我用这段代码对它们进行排序，以找出列表中最常见的 10 种文件类型。

find $DIR -type f | file -b $SAVEFILES | cut -c1-40 | sort -n | uniq -c | sort -nr | head -10

我的输出看起来像这样

    168 HTML document, ASCII text
    114 C source, ASCII text
    102 ASCII text
     33 ASCII text, with very long lines
     30 HTML document, UTF-8 Unicode text, with 
     26 HTML document, ASCII text, with very lon
     21 C source, UTF-8 Unicode text
     20 LaTeX document, UTF-8 Unicode text, with
     15 SVG Scalable Vector Graphics image
     12 LaTeX document, ASCII text, with very lo

我想做的是访问文件类型之前的值并替换它们#。我可以使用 for 循环执行此操作，但首先我以某种方式访问它们。

预期的输出是这样的：

   __HTML document, ASCII text               : ################
   __C source, ASCII text                    : ###########
   __ASCII text                              : ##########
   __ASCII text, with very long lines        : ########
   __HTML document, UTF-8 Unicode text, with : #######
   __HTML document, ASCII text, with very lon: ####
   __C source, UTF-8 Unicode text            : #### 
   __LaTeX document, UTF-8 Unicode text, with: ###
   __SVG Scalable Vector Graphics image      : #
   __LaTeX document, ASCII text, with very lo: #

编辑：# 在我的示例中不代表确切的数字。第一行应该有 168 #，第二行应该有 114 # 等等

Answer 1

追加：

| while read -r n text; do printf "__%s%$((48-${#text}))s: " "$text"; for ((i=0;i<$n;i++)); do printf "%s" "#"; done; echo; done

根据需要更改48。

输入输出：

__HTML document, ASCII text                       : ########################################################################################################################################################################
__C source, ASCII text                            : ##################################################################################################################
__ASCII text                                      : ######################################################################################################
__ASCII text, with very long lines                : #################################
__HTML document, UTF-8 Unicode text, with         : ##############################
__HTML document, ASCII text, with very lon        : ##########################
__C source, UTF-8 Unicode text                    : #####################
__LaTeX document, UTF-8 Unicode text, with        : ####################
__SVG Scalable Vector Graphics image              : ###############
__LaTeX document, ASCII text, with very lo        : ############

Answer 2

perl 方法，添加：

| perl -lpE 's/\s*(\d+)\s(.*)/sprintf "__%-40s: %s", , "#"x/e'

输出

__HTML document, ASCII text               : ########################################################################################################################################################################
__C source, ASCII text                    : ##################################################################################################################
__ASCII text                              : ######################################################################################################
__ASCII text, with very long lines        : #################################
__HTML document, UTF-8 Unicode text, with : ##############################
__HTML document, ASCII text, with very lon: ##########################
__C source, UTF-8 Unicode text            : #####################
__LaTeX document, UTF-8 Unicode text, with: ####################
__SVG Scalable Vector Graphics image      : ###############
__LaTeX document, ASCII text, with very lo: ############

按照@Ed 的方法，只需使用 perl

find "$DIR" -type f | file -b "$SAVEFILES" |\
  perl -lnE '$s{substr$_,0,40}++;}{printf"__%-40s: %s\n",$_,"#"x$s{$_}for(splice@{[sort{$s{$b}<=>$s{$a}}keys%s]},0,9)'

可读：

perl -lnE '
$seen{ substr $_,0,40 }++;
END {
   printf"__%-40s: %s\n", $_, "#" x $seen{$_}
      for( splice @{[sort { $seen{$b} <=> $seen{$a} } keys %seen]},0,9 )
}'

Ps：请注意，文件实用程序只会测试 $SAVEFILES 中的文件，因此 find ... | file -b $SAVEFILES 毫无意义

Answer 3

shell 循环永远不是处理文本的正确方法，参见 why-is-using-a-shell-loop-to-process-text-considered-bad-practice。

您可以使用此 awk 命令执行您要求的操作：

$ awk '{printf "%-40s: %s\n", substr([=10=],9), gensub(/ /,"#","g",sprintf("%*s",,""))}' file
HTML document, ASCII text               : ########################################################################################################################################################################
C source, ASCII text                    : ##################################################################################################################
ASCII text                              : ######################################################################################################
ASCII text, with very long lines        : #################################
HTML document, UTF-8 Unicode text, with : ##############################
HTML document, ASCII text, with very lon: ##########################
C source, UTF-8 Unicode text            : #####################
LaTeX document, UTF-8 Unicode text, with: ####################
SVG Scalable Vector Graphics image      : ###############
LaTeX document, ASCII text, with very lo: ############

但正确的做法是摆脱 cut 之后的所有内容，然后执行如下操作：

find "$DIR" -type f | file -b "$SAVEFILES" |
awk '
{ types[substr([=11=],1,40)]++ }
END {
    PROCINFO["sorted_in"] = "@ind_num_desc"
    for (type in types) {
        printf "%-*s: %s\n", 40, type, gensub(/ /,"#","g",sprintf("%*s",cnt[type],""))
        if (++cnt == 10) {
            break
        }
    }
}
'

上面对 sorted_in 和 gensub() 使用 GNU awk，第二个未经测试，因为您只为最后一部分提供了示例输入，打印“#”s

使用 uniq -c 时如何访问前缀

How to access the prefix when using uniq -c

dash-shell