用键分隔行并存储在不同的文件中

Question

如何在一个文本文件中分离（获取）整行十六进制数字key相关的整行和DEBUG整行，然后存储在不同的文件中，其中key是这样的格式：“[uid key]” ？即忽略任何非 DEBUG 的行。

in.txt:

  [ uid 28fd4583833] DEBUG web.Action
  [ uid 39fd5697944] DEBUG test.Action
  [ uid 56866969445] DEBUG test2.Action
  [ uid 76696944556] INFO  test4.Action
  [ uid 39fd5697944] DEBUG test7.Action
  [ uid 85483e10256] DEBUG testing.Action

输出文件命名为“out”+ i +“.txt”，其中 i = 1、2、3、4。即

out1.txt:

  [ uid 28fd4583833] DEBUG web.Action

out2.txt:

  [ uid 39fd5697944] DEBUG test.Action
  [ uid 39fd5697944] DEBUG test7.Action

out3.txt:

  [ uid 56866969445] DEBUG test2.Action

out4.txt:

  [ uid 85483e10256] DEBUG testing.Action

我试过了：

awk 'match([=15=], /uid ([^]]+)/, a) && /DEBUG/ {print > (a[1] ".txt")}' in.txt

Answer 1

第一个解决方案： 使用 GNU awk 尝试遵循单个 awk 代码。我在哪里使用 GNU awk.

的 PROCINFO["sorted_in"] 方法

awk '
BEGIN{
  PROCINFO["sorted_in"] = "@ind_num_asc"
}
!/DEBUG/{ next }
match([=10=],/uid [a-zA-Z0-9]+/){
  ind=substr([=10=],RSTART,RLENGTH)
  arr[ind]=(arr[ind]?arr[ind] ORS:"") [=10=]
}
END{
  for(i in arr){
    outputFile=("out"++count".txt")
    print arr[i] > (outputFile)
    close(outputFile)
  }
}
'  Input_file

第二个解决方案： 与任何 awk，请使用您显示的示例尝试以下解决方案。在此处将 Input_file 名称更改为您的实际文件名。在此处使用带有选项 -s 的 GNU sort 在对值进行排序时保持顺序。

awk '
!/DEBUG/{ next }
match([=11=],/uid [0-9a-zA-Z]+/){
  print substr([=11=],RSTART,RLENGTH)";"[=11=]
}' Input_file  | 
sort -sk2n     | 
cut -d';' -f2- | 
awk '
match([=11=],/uid [0-9a-zA-Z]+/){
  if(prev!=substr([=11=],RSTART,RLENGTH)){
    count++
    close(outputFile)
  }
  outputFile="out"count".txt"
  print > (outputFile)
  prev=substr([=11=],RSTART,RLENGTH)
}
'

第一个解决方案的解释：为第一个解决方案添加详细解释：

awk '                                       ##Starting awk program from here.
BEGIN{                                      ##Starting BEGIN section from here.
  PROCINFO["sorted_in"] = "@ind_num_asc"    ##Setting PROCINFO["sorted_in"] to @ind_num_asc to sort any array with index.
}
!/DEBUG/{ next }                            ##If a line does not contain DEBUG then jump to next line.
match([=12=],/uid [a-zA-Z0-9]+/){               ##using match function to match uid space and alphanumeric values here.
  ind=substr([=12=],RSTART,RLENGTH)             ##Creating ind which contains sub string of matched sub string in match function.
  arr[ind]=(arr[ind]?arr[ind] ORS:"") [=12=]    ##Creating array arr with index of ind and keep adding current line value to same index.
}
END{                                        ##Starting END block of this program from here.
  for(i in arr){                            ##Traversing through array arr here.
    outputFile=("out"++count".txt")         ##Creating output file name here as per OP requirement.
    print arr[i] > (outputFile)             ##printing current array element into outputFile variable.
    close(outputFile)                       ##Closing output file in backend to avoid too many files opened error.
  }
}
'  Input_file                               ##Mentioning Input_file name here.

Answer 2

如果您愿意更改输出文件名以包含密钥（坦率地说，这似乎比名称中的 one-up 计数器更有用），您可以这样做：

awk '/DEBUG/{print > ("out-"  ".txt")}' FS='[][ ]*'  in.txt

这会将所有匹配字符串 DEBUG 和关键字 85483e10256 的行放入文件 out-85483e10256.txt 等

如果您确实想保留 one-up 计数器，您可以这样做：

 awk '/DEBUG/{if( ! a[] ) a[] = ++counter;
     print > ("out" a[] ".txt")}' FS='[][ ]*'  in.txt

基本上，想法是使用正则表达式 [][ ]* 作为字段分隔符，匹配方括号或 space 的字符串。这样，</code> 是初始 <code>[ 之前的文本，</code> 是字符串 <code>uid，而是密钥。这将（应该！）正确地获取可能具有略微不同的白色 space 的行的密钥。我们使用关联数组来跟踪哪些键已经被看到以跟踪计数器。但是在输出文件名中使用密钥确实更干净。

Answer 3

如果您的文件格式与您显示的一致，您可以这样做：

awk '
    !="DEBUG" { next }
    !f[] { f[]=++i }
    { print > ("out" f[] ".txt") }
' in.txt

Answer 4

对 -s 使用 GNU 排序（以保证保留每个键值的输入行顺序）和任何 awk:

$ sort -sk3,3 in.txt |
    awk '!="DEBUG"{next} !=prev{close(out); out="out"(++i)".txt"; prev=} {print > out}'

$ head out*.txt
==> out1.txt <==
  [ uid 28fd4583833] DEBUG web.Action

==> out2.txt <==
  [ uid 39fd5697944] DEBUG test.Action
  [ uid 39fd5697944] DEBUG test7.Action

==> out3.txt <==
  [ uid 56866969445] DEBUG test2.Action

==> out5.txt <==
  [ uid 85483e10256] DEBUG testing.Action

如果您没有 GNU 排序，那么您可以使用任何排序应用 DSU (Decorate/Sort/Undecorate) 习语：

$ awk -v OFS='\t' '{print NR, [=12=]}' in.txt | sort -k4,4 -k1,1n | cut -f2- |
    awk '!="DEBUG"{next} !=prev{close(out); out="out"(++i)".txt"; prev=} {print > out}'

请注意，上面只有 sort 必须处理内存中的所有输入，并且它旨在使用请求分页等来处理大量输入，而 awk 仅处理 1 行一次并且几乎不在内存中保留任何内容，并且一次只打开 1 个输出文件，因此对于大文件，上述方法比在 awk 中存储大量内存或同时打开许多输出文件的方法更有可能成功.

Answer 5

一个相对便携的基于 awk 的解决方案，具有这些亮点 ::

输出行不截断前缘 double space
输出文件名遵循稳定的输入行顺序，无需 pre-sort 行、post-sort 行或利用 gnu gawk 特定功能
测试并确认正在处理
- gawk 5.1.1，包括-ce标志，
- mawk 1.3.4,
- mawk 1.9.9.6，以及
- macOS nawk 20200816

————————————————————————————————

    # gawk profile, created Thu May 19 12:10:56 2022

    BEGIN {
        ____ =      "test_72297811_"        # opt. filename prefix
         OFS = FS = "^  [[] uid "
        _+=_ = gsub("\^|[[][]]", _, OFS)
        _*=  _-- 
    } NF *= / DEBUG / {
        print >> (__[___ = substr($NF,_~_,_)] ?__[___]:\
                  __[___]= ____ "out" length(__) ".txt" )
    } END { 
           for (_ in __) { close(__[_]) } }'

————————————————————————————————

==> test_72297811_out1.txt <==
  [ uid 28fd4583833] DEBUG web.Action

==> test_72297811_out2.txt <==
  [ uid 39fd5697944] DEBUG test.Action
  [ uid 39fd5697944] DEBUG test7.Action

==> test_72297811_out3.txt <==
  [ uid 56866969445] DEBUG test2.Action

==> test_72297811_out4.txt <==
  [ uid 85483e10256] DEBUG testing.Action

用键分隔行并存储在不同的文件中

Separate lines with keys and store in different files

awk

grep

cut

sed