使用 gsub 将变量替换为另一个变量,该变量是函数调用的值

Using gsub to substitute a variable with another variable which is a value from a function call

我有一个函数可以用文件中的某些模式替换实际值。 objective 我在这里试图实现的是调用一个函数,该函数使用 gsub 以替换值基本上来自另一个函数调用的方式查找和替换字符串。

$ cat pat-file
name         10101010
phone        10101010
code         10101010
bankaccount  1010101010101

$ cat data_sub.sh

abc()
{
awk '
function mask(str, str_masked) {
    for (j=1; j<=length(str); j++) {
        if (substr(masks[i], j, 1)==1) {
            c = substr(str, j, 1)
        } else {
            c = "*"
        }

        str_masked = str_masked c
    }

    return str_masked
}

FNR == NR {
    tags[NR-1] = 
    masks[NR-1] = 
}

FNR != NR {
    line = [=10=]

    for (i in tags) {
        regex = "<"tags[i]">[^<]+</"tags[i]">"
        masked_line = ""
        l = length(tags[i])
        while (match(line, regex) > 0) {
            fulltag = substr(line, RSTART, RLENGTH)
            tagval = substr(fulltag, l+3, RLENGTH-l-l-5)
            fulltag_masked = "<"tags[i]">" mask(tagval) "</"tags[i]">"
            masked_line = masked_line substr(line, 1, RSTART-1) fulltag_masked

            line = substr(line, RSTART + RLENGTH)
        }

        line = masked_line line
    }

    print line
}' "$@" pat-file file-1 > output_file
}

abc

tagval 变量存储 XML 标记的值,该标记在 XML 内被屏蔽,但由于它也存在于 XML 之外,我需要也掩盖这些价值观。查看输入文件

文件-1

This is a demo data = ABCD
This is a demo data = XYCD
This is a demo data = ABCD
This is a demo data = BLAH
This is a demo data = ABCD
This is a demo data = MEH
This is a demo data = ABCD
This is a demo data = ABCD
This is a demo data = ABCD
This is a demo data = ABCD and MEH
This is a demo data <tag changed="yes"<name>ABCD</name><phone>98762123</phone><code>MEH</code><bankaccount>4563728495847</bankaccount></tag>
This is a demo data <tag changed="yes"<name>ABCD</name><phone>98762123</phone><code>MEH</code><bankaccount>4563728495847</bankaccount></tag>
This is a demo data <tag changed="yes"<name>ABCD</name><phone>98762123</phone><code>MEH</code><bankaccount>4563728495847</bankaccount></tag>

逻辑简单明了,即存储所有被屏蔽的提取标签值,然后对这些值执行相同的屏蔽算法,但在 XML 之外。我怎样才能做到这一点?

输出文件

This is a demo data = ABCD
This is a demo data = XYCD
This is a demo data = ABCD
This is a demo data = BLAH
This is a demo data = ABCD
This is a demo data = MEH
This is a demo data = ABCD
This is a demo data = ABCD
This is a demo data = ABCD
This is a demo data = ABCD and MEH
This is a demo data <tag changed="yes"<name>A*C*</name><phone>9*7*2*2*</phone><code>M*H</code><bankaccount>4*6*7*8*9*8*7</bankaccount></tag>
This is a demo data <tag changed="yes"<name>A*C*</name><phone>9*7*2*2*</phone><code>M*H</code><bankaccount>4*6*7*8*9*8*7</bankaccount></tag>
This is a demo data <tag changed="yes"<name>A*C*</name><phone>9*7*2*2*</phone><code>M*H</code><bankaccount>4*6*7*8*9*8*7</bankaccount></tag>

预期输出文件

This is a demo data = A*C*
This is a demo data = XYCD
This is a demo data = A*C*
This is a demo data = BLAH
This is a demo data = A*C*
This is a demo data = M*H
This is a demo data = A*C*
This is a demo data = A*C*
This is a demo data = A*C*
This is a demo data = A*C* and M*H
This is a demo data <tag changed="yes"<name>A*C*</name><phone>9*7*2*2*</phone><code>M*H</code><bankaccount>4*6*7*8*9*8*7</bankaccount></tag>
This is a demo data <tag changed="yes"<name>A*C*</name><phone>9*7*2*2*</phone><code>M*H</code><bankaccount>4*6*7*8*9*8*7</bankaccount></tag>
This is a demo data <tag changed="yes"<name>A*C*</name><phone>9*7*2*2*</phone><code>M*H</code><bankaccount>4*6*7*8*9*8*7</bankaccount></tag>

假设:

  • 如果一个字符串出现在不同的标签下(例如,name=ABCDcode=ABCD),那么 awk 找到的第一个掩码将用于掩码字符串(即,我们不会优先处理 tag/mask 对的处理顺序)
  • 字符串(被屏蔽)可以出现在一行中的任何地方
  • 匹配 non-tag 子串时,我们将使用 awk 字边界(例如,当屏蔽 ABCD 时,我们也会屏蔽 ABCD-XYZ 但我们不会掩码 ABCDABCDABCD_XYZ)
  • 这两个文件以及 value/masked-value 对的数组将适合内存
  • 如果 OP 提供了 111111111...(所有 1's)的掩码,我们将继续执行(有效的)no-op 操作

一般操作:

  • 处理输入文件(例如,file-1)寻找 'tag' 个条目
  • 如果我们找到任何匹配的 'tag' 条目,我们会将建议的掩码应用于相应的值
  • 对于每个被掩码的值,我们将在新数组中保留该值及其掩码的副本
  • 对于重复值,我们将应用保存的掩码
  • 所有行,有或没有tags/masked-data,都保存在一个数组中
  • END 处理再次遍历我们的行数组,查找之前被屏蔽的任何 (word-boundaried) 字符串,如果找到,则替换为保存的屏蔽值
  • 在掩码 11111111...(所有 1's)的情况下,此 END 处理也会 re-mask 'tag' 条目(仍然,实际上,no-op)
  • 然后将所有行发送到标准输出

向示例输入文件添加一些行:

$ cat file-1
This is a demo data = ABCD
This is a demo data = XYCD
This is a demo data = ABCD
This is a demo data = BLAH
This is a demo data = ABCD
This is a demo data = MEH
This is a demo data = ABCD
This is a demo data = ABCD
This is a demo data = ABCD
This is a demo data = ABCD and MEH
This is a demo data <tag changed="yes"<name>ABCD</name><phone>98762123</phone><code>MEH</code><bankaccount>4563728495847</bankaccount></tag>
This is a demo data <tag changed="yes"<name>ABCD</name><phone>98762123</phone><code>MEH</code><bankaccount>4563728495847</bankaccount></tag>
This is a demo data <tag changed="yes"<name>ABCD</name><phone>98762123</phone><code>MEH</code><bankaccount>4563728495847</bankaccount></tag>
#####################
# some more lines ...
#####################
This is a demo data = ABCD and XYCD
This is a demo data = XYCD and MEH
This is ABCD and MEH demo data <tag changed="yes"<name>Winkelstein</name><phone>98762123</phone><code>MEH</code><bankaccount>4563728495847</bankaccount></tag>
One last line ABCD ABCD-XYZ ABCDABCD ABCD_XYZ

基于 OP 当前 awk 代码的一个想法:

awk '
function mask(str, str_masked) {
    for (j=1; j<=length(str); j++) {
        if (substr(masks[tag], j, 1) == 1)
           c = substr(str, j, 1)
        else
           c = "*"
        str_masked = str_masked c
    }
    return str_masked
}

FNR == NR { masks[] = ; next }
          { line = [=11=]

            for (tag in masks) {
                regex = "<" tag ">[^<]+</" tag ">"
                masked_line = ""
                len = length(tag)

                while (match(line, regex) > 0) {
                      val = substr(line, RSTART+(len+2), RLENGTH-(len+2)-(len+3))
                      masked[val]= (val in masked) ? masked[val] : mask(val)
                      masked_line = masked_line substr(line, 1, RSTART-1) "<" tag ">" masked[val] "</" tag ">"
                      line = substr(line, RSTART + RLENGTH)
                }
                line = masked_line line
            }
            lines[FNR]=line
        }

END     { for (i=1;i<=FNR;i++) {
              for (val in masked) {
                  regex="\<" val "\>"
                  gsub(regex,masked[val],lines[i])
              }
              print lines[i]
          }
        }
' pat-file file-1

这会生成:

This is a demo data = A*C*
This is a demo data = XYCD
This is a demo data = A*C*
This is a demo data = BLAH
This is a demo data = A*C*
This is a demo data = M*H
This is a demo data = A*C*
This is a demo data = A*C*
This is a demo data = A*C*
This is a demo data = A*C* and M*H
This is a demo data <tag changed="yes"<name>A*C*</name><phone>9*7*2*2*</phone><code>M*H</code><bankaccount>4*6*7*8*9*8*7</bankaccount></tag>
This is a demo data <tag changed="yes"<name>A*C*</name><phone>9*7*2*2*</phone><code>M*H</code><bankaccount>4*6*7*8*9*8*7</bankaccount></tag>
This is a demo data <tag changed="yes"<name>A*C*</name><phone>9*7*2*2*</phone><code>M*H</code><bankaccount>4*6*7*8*9*8*7</bankaccount></tag>
#####################
# some more lines ...
#####################
This is a demo data = A*C* and XYCD
This is a demo data = XYCD and M*H
This is A*C* and M*H demo data <tag changed="yes"<name>W*n*e*s****</name><phone>9*7*2*2*</phone><code>M*H</code><bankaccount>4*6*7*8*9*8*7</bankaccount></tag>
One last line A*C* A*C*-XYZ ABCDABCD ABCD_XYZ