使用 gsub 将变量替换为另一个变量,该变量是函数调用的值
Using gsub to substitute a variable with another variable which is a value from a function call
我有一个函数可以用文件中的某些模式替换实际值。 objective 我在这里试图实现的是调用一个函数,该函数使用 gsub
以替换值基本上来自另一个函数调用的方式查找和替换字符串。
$ cat pat-file
name 10101010
phone 10101010
code 10101010
bankaccount 1010101010101
$ cat data_sub.sh
abc()
{
awk '
function mask(str, str_masked) {
for (j=1; j<=length(str); j++) {
if (substr(masks[i], j, 1)==1) {
c = substr(str, j, 1)
} else {
c = "*"
}
str_masked = str_masked c
}
return str_masked
}
FNR == NR {
tags[NR-1] =
masks[NR-1] =
}
FNR != NR {
line = [=10=]
for (i in tags) {
regex = "<"tags[i]">[^<]+</"tags[i]">"
masked_line = ""
l = length(tags[i])
while (match(line, regex) > 0) {
fulltag = substr(line, RSTART, RLENGTH)
tagval = substr(fulltag, l+3, RLENGTH-l-l-5)
fulltag_masked = "<"tags[i]">" mask(tagval) "</"tags[i]">"
masked_line = masked_line substr(line, 1, RSTART-1) fulltag_masked
line = substr(line, RSTART + RLENGTH)
}
line = masked_line line
}
print line
}' "$@" pat-file file-1 > output_file
}
abc
tagval
变量存储 XML 标记的值,该标记在 XML 内被屏蔽,但由于它也存在于 XML 之外,我需要也掩盖这些价值观。查看输入文件
文件-1
This is a demo data = ABCD
This is a demo data = XYCD
This is a demo data = ABCD
This is a demo data = BLAH
This is a demo data = ABCD
This is a demo data = MEH
This is a demo data = ABCD
This is a demo data = ABCD
This is a demo data = ABCD
This is a demo data = ABCD and MEH
This is a demo data <tag changed="yes"<name>ABCD</name><phone>98762123</phone><code>MEH</code><bankaccount>4563728495847</bankaccount></tag>
This is a demo data <tag changed="yes"<name>ABCD</name><phone>98762123</phone><code>MEH</code><bankaccount>4563728495847</bankaccount></tag>
This is a demo data <tag changed="yes"<name>ABCD</name><phone>98762123</phone><code>MEH</code><bankaccount>4563728495847</bankaccount></tag>
逻辑简单明了,即存储所有被屏蔽的提取标签值,然后对这些值执行相同的屏蔽算法,但在 XML 之外。我怎样才能做到这一点?
输出文件
This is a demo data = ABCD
This is a demo data = XYCD
This is a demo data = ABCD
This is a demo data = BLAH
This is a demo data = ABCD
This is a demo data = MEH
This is a demo data = ABCD
This is a demo data = ABCD
This is a demo data = ABCD
This is a demo data = ABCD and MEH
This is a demo data <tag changed="yes"<name>A*C*</name><phone>9*7*2*2*</phone><code>M*H</code><bankaccount>4*6*7*8*9*8*7</bankaccount></tag>
This is a demo data <tag changed="yes"<name>A*C*</name><phone>9*7*2*2*</phone><code>M*H</code><bankaccount>4*6*7*8*9*8*7</bankaccount></tag>
This is a demo data <tag changed="yes"<name>A*C*</name><phone>9*7*2*2*</phone><code>M*H</code><bankaccount>4*6*7*8*9*8*7</bankaccount></tag>
预期输出文件
This is a demo data = A*C*
This is a demo data = XYCD
This is a demo data = A*C*
This is a demo data = BLAH
This is a demo data = A*C*
This is a demo data = M*H
This is a demo data = A*C*
This is a demo data = A*C*
This is a demo data = A*C*
This is a demo data = A*C* and M*H
This is a demo data <tag changed="yes"<name>A*C*</name><phone>9*7*2*2*</phone><code>M*H</code><bankaccount>4*6*7*8*9*8*7</bankaccount></tag>
This is a demo data <tag changed="yes"<name>A*C*</name><phone>9*7*2*2*</phone><code>M*H</code><bankaccount>4*6*7*8*9*8*7</bankaccount></tag>
This is a demo data <tag changed="yes"<name>A*C*</name><phone>9*7*2*2*</phone><code>M*H</code><bankaccount>4*6*7*8*9*8*7</bankaccount></tag>
假设:
- 如果一个字符串出现在不同的标签下(例如,
name=ABCD
和 code=ABCD
),那么 awk
找到的第一个掩码将用于掩码字符串(即,我们不会优先处理 tag/mask 对的处理顺序)
- 字符串(被屏蔽)可以出现在一行中的任何地方
- 匹配 non-tag 子串时,我们将使用
awk
字边界(例如,当屏蔽 ABCD
时,我们也会屏蔽 ABCD-XYZ
但我们不会掩码 ABCDABCD
或 ABCD_XYZ
)
- 这两个文件以及 value/masked-value 对的数组将适合内存
- 如果 OP 提供了
111111111...
(所有 1's
)的掩码,我们将继续执行(有效的)no-op 操作
一般操作:
- 处理输入文件(例如,
file-1
)寻找 'tag' 个条目
- 如果我们找到任何匹配的 'tag' 条目,我们会将建议的掩码应用于相应的值
- 对于每个被掩码的值,我们将在新数组中保留该值及其掩码的副本
- 对于重复值,我们将应用保存的掩码
- 所有行,有或没有tags/masked-data,都保存在一个数组中
END
处理再次遍历我们的行数组,查找之前被屏蔽的任何 (word-boundaried) 字符串,如果找到,则替换为保存的屏蔽值
- 在掩码
11111111...
(所有 1's
)的情况下,此 END
处理也会 re-mask 'tag' 条目(仍然,实际上,no-op)
- 然后将所有行发送到标准输出
向示例输入文件添加一些行:
$ cat file-1
This is a demo data = ABCD
This is a demo data = XYCD
This is a demo data = ABCD
This is a demo data = BLAH
This is a demo data = ABCD
This is a demo data = MEH
This is a demo data = ABCD
This is a demo data = ABCD
This is a demo data = ABCD
This is a demo data = ABCD and MEH
This is a demo data <tag changed="yes"<name>ABCD</name><phone>98762123</phone><code>MEH</code><bankaccount>4563728495847</bankaccount></tag>
This is a demo data <tag changed="yes"<name>ABCD</name><phone>98762123</phone><code>MEH</code><bankaccount>4563728495847</bankaccount></tag>
This is a demo data <tag changed="yes"<name>ABCD</name><phone>98762123</phone><code>MEH</code><bankaccount>4563728495847</bankaccount></tag>
#####################
# some more lines ...
#####################
This is a demo data = ABCD and XYCD
This is a demo data = XYCD and MEH
This is ABCD and MEH demo data <tag changed="yes"<name>Winkelstein</name><phone>98762123</phone><code>MEH</code><bankaccount>4563728495847</bankaccount></tag>
One last line ABCD ABCD-XYZ ABCDABCD ABCD_XYZ
基于 OP 当前 awk
代码的一个想法:
awk '
function mask(str, str_masked) {
for (j=1; j<=length(str); j++) {
if (substr(masks[tag], j, 1) == 1)
c = substr(str, j, 1)
else
c = "*"
str_masked = str_masked c
}
return str_masked
}
FNR == NR { masks[] = ; next }
{ line = [=11=]
for (tag in masks) {
regex = "<" tag ">[^<]+</" tag ">"
masked_line = ""
len = length(tag)
while (match(line, regex) > 0) {
val = substr(line, RSTART+(len+2), RLENGTH-(len+2)-(len+3))
masked[val]= (val in masked) ? masked[val] : mask(val)
masked_line = masked_line substr(line, 1, RSTART-1) "<" tag ">" masked[val] "</" tag ">"
line = substr(line, RSTART + RLENGTH)
}
line = masked_line line
}
lines[FNR]=line
}
END { for (i=1;i<=FNR;i++) {
for (val in masked) {
regex="\<" val "\>"
gsub(regex,masked[val],lines[i])
}
print lines[i]
}
}
' pat-file file-1
这会生成:
This is a demo data = A*C*
This is a demo data = XYCD
This is a demo data = A*C*
This is a demo data = BLAH
This is a demo data = A*C*
This is a demo data = M*H
This is a demo data = A*C*
This is a demo data = A*C*
This is a demo data = A*C*
This is a demo data = A*C* and M*H
This is a demo data <tag changed="yes"<name>A*C*</name><phone>9*7*2*2*</phone><code>M*H</code><bankaccount>4*6*7*8*9*8*7</bankaccount></tag>
This is a demo data <tag changed="yes"<name>A*C*</name><phone>9*7*2*2*</phone><code>M*H</code><bankaccount>4*6*7*8*9*8*7</bankaccount></tag>
This is a demo data <tag changed="yes"<name>A*C*</name><phone>9*7*2*2*</phone><code>M*H</code><bankaccount>4*6*7*8*9*8*7</bankaccount></tag>
#####################
# some more lines ...
#####################
This is a demo data = A*C* and XYCD
This is a demo data = XYCD and M*H
This is A*C* and M*H demo data <tag changed="yes"<name>W*n*e*s****</name><phone>9*7*2*2*</phone><code>M*H</code><bankaccount>4*6*7*8*9*8*7</bankaccount></tag>
One last line A*C* A*C*-XYZ ABCDABCD ABCD_XYZ
我有一个函数可以用文件中的某些模式替换实际值。 objective 我在这里试图实现的是调用一个函数,该函数使用 gsub
以替换值基本上来自另一个函数调用的方式查找和替换字符串。
$ cat pat-file
name 10101010
phone 10101010
code 10101010
bankaccount 1010101010101
$ cat data_sub.sh
abc()
{
awk '
function mask(str, str_masked) {
for (j=1; j<=length(str); j++) {
if (substr(masks[i], j, 1)==1) {
c = substr(str, j, 1)
} else {
c = "*"
}
str_masked = str_masked c
}
return str_masked
}
FNR == NR {
tags[NR-1] =
masks[NR-1] =
}
FNR != NR {
line = [=10=]
for (i in tags) {
regex = "<"tags[i]">[^<]+</"tags[i]">"
masked_line = ""
l = length(tags[i])
while (match(line, regex) > 0) {
fulltag = substr(line, RSTART, RLENGTH)
tagval = substr(fulltag, l+3, RLENGTH-l-l-5)
fulltag_masked = "<"tags[i]">" mask(tagval) "</"tags[i]">"
masked_line = masked_line substr(line, 1, RSTART-1) fulltag_masked
line = substr(line, RSTART + RLENGTH)
}
line = masked_line line
}
print line
}' "$@" pat-file file-1 > output_file
}
abc
tagval
变量存储 XML 标记的值,该标记在 XML 内被屏蔽,但由于它也存在于 XML 之外,我需要也掩盖这些价值观。查看输入文件
文件-1
This is a demo data = ABCD
This is a demo data = XYCD
This is a demo data = ABCD
This is a demo data = BLAH
This is a demo data = ABCD
This is a demo data = MEH
This is a demo data = ABCD
This is a demo data = ABCD
This is a demo data = ABCD
This is a demo data = ABCD and MEH
This is a demo data <tag changed="yes"<name>ABCD</name><phone>98762123</phone><code>MEH</code><bankaccount>4563728495847</bankaccount></tag>
This is a demo data <tag changed="yes"<name>ABCD</name><phone>98762123</phone><code>MEH</code><bankaccount>4563728495847</bankaccount></tag>
This is a demo data <tag changed="yes"<name>ABCD</name><phone>98762123</phone><code>MEH</code><bankaccount>4563728495847</bankaccount></tag>
逻辑简单明了,即存储所有被屏蔽的提取标签值,然后对这些值执行相同的屏蔽算法,但在 XML 之外。我怎样才能做到这一点?
输出文件
This is a demo data = ABCD
This is a demo data = XYCD
This is a demo data = ABCD
This is a demo data = BLAH
This is a demo data = ABCD
This is a demo data = MEH
This is a demo data = ABCD
This is a demo data = ABCD
This is a demo data = ABCD
This is a demo data = ABCD and MEH
This is a demo data <tag changed="yes"<name>A*C*</name><phone>9*7*2*2*</phone><code>M*H</code><bankaccount>4*6*7*8*9*8*7</bankaccount></tag>
This is a demo data <tag changed="yes"<name>A*C*</name><phone>9*7*2*2*</phone><code>M*H</code><bankaccount>4*6*7*8*9*8*7</bankaccount></tag>
This is a demo data <tag changed="yes"<name>A*C*</name><phone>9*7*2*2*</phone><code>M*H</code><bankaccount>4*6*7*8*9*8*7</bankaccount></tag>
预期输出文件
This is a demo data = A*C*
This is a demo data = XYCD
This is a demo data = A*C*
This is a demo data = BLAH
This is a demo data = A*C*
This is a demo data = M*H
This is a demo data = A*C*
This is a demo data = A*C*
This is a demo data = A*C*
This is a demo data = A*C* and M*H
This is a demo data <tag changed="yes"<name>A*C*</name><phone>9*7*2*2*</phone><code>M*H</code><bankaccount>4*6*7*8*9*8*7</bankaccount></tag>
This is a demo data <tag changed="yes"<name>A*C*</name><phone>9*7*2*2*</phone><code>M*H</code><bankaccount>4*6*7*8*9*8*7</bankaccount></tag>
This is a demo data <tag changed="yes"<name>A*C*</name><phone>9*7*2*2*</phone><code>M*H</code><bankaccount>4*6*7*8*9*8*7</bankaccount></tag>
假设:
- 如果一个字符串出现在不同的标签下(例如,
name=ABCD
和code=ABCD
),那么awk
找到的第一个掩码将用于掩码字符串(即,我们不会优先处理 tag/mask 对的处理顺序) - 字符串(被屏蔽)可以出现在一行中的任何地方
- 匹配 non-tag 子串时,我们将使用
awk
字边界(例如,当屏蔽ABCD
时,我们也会屏蔽ABCD-XYZ
但我们不会掩码ABCDABCD
或ABCD_XYZ
) - 这两个文件以及 value/masked-value 对的数组将适合内存
- 如果 OP 提供了
111111111...
(所有1's
)的掩码,我们将继续执行(有效的)no-op 操作
一般操作:
- 处理输入文件(例如,
file-1
)寻找 'tag' 个条目 - 如果我们找到任何匹配的 'tag' 条目,我们会将建议的掩码应用于相应的值
- 对于每个被掩码的值,我们将在新数组中保留该值及其掩码的副本
- 对于重复值,我们将应用保存的掩码
- 所有行,有或没有tags/masked-data,都保存在一个数组中
END
处理再次遍历我们的行数组,查找之前被屏蔽的任何 (word-boundaried) 字符串,如果找到,则替换为保存的屏蔽值- 在掩码
11111111...
(所有1's
)的情况下,此END
处理也会 re-mask 'tag' 条目(仍然,实际上,no-op) - 然后将所有行发送到标准输出
向示例输入文件添加一些行:
$ cat file-1
This is a demo data = ABCD
This is a demo data = XYCD
This is a demo data = ABCD
This is a demo data = BLAH
This is a demo data = ABCD
This is a demo data = MEH
This is a demo data = ABCD
This is a demo data = ABCD
This is a demo data = ABCD
This is a demo data = ABCD and MEH
This is a demo data <tag changed="yes"<name>ABCD</name><phone>98762123</phone><code>MEH</code><bankaccount>4563728495847</bankaccount></tag>
This is a demo data <tag changed="yes"<name>ABCD</name><phone>98762123</phone><code>MEH</code><bankaccount>4563728495847</bankaccount></tag>
This is a demo data <tag changed="yes"<name>ABCD</name><phone>98762123</phone><code>MEH</code><bankaccount>4563728495847</bankaccount></tag>
#####################
# some more lines ...
#####################
This is a demo data = ABCD and XYCD
This is a demo data = XYCD and MEH
This is ABCD and MEH demo data <tag changed="yes"<name>Winkelstein</name><phone>98762123</phone><code>MEH</code><bankaccount>4563728495847</bankaccount></tag>
One last line ABCD ABCD-XYZ ABCDABCD ABCD_XYZ
基于 OP 当前 awk
代码的一个想法:
awk '
function mask(str, str_masked) {
for (j=1; j<=length(str); j++) {
if (substr(masks[tag], j, 1) == 1)
c = substr(str, j, 1)
else
c = "*"
str_masked = str_masked c
}
return str_masked
}
FNR == NR { masks[] = ; next }
{ line = [=11=]
for (tag in masks) {
regex = "<" tag ">[^<]+</" tag ">"
masked_line = ""
len = length(tag)
while (match(line, regex) > 0) {
val = substr(line, RSTART+(len+2), RLENGTH-(len+2)-(len+3))
masked[val]= (val in masked) ? masked[val] : mask(val)
masked_line = masked_line substr(line, 1, RSTART-1) "<" tag ">" masked[val] "</" tag ">"
line = substr(line, RSTART + RLENGTH)
}
line = masked_line line
}
lines[FNR]=line
}
END { for (i=1;i<=FNR;i++) {
for (val in masked) {
regex="\<" val "\>"
gsub(regex,masked[val],lines[i])
}
print lines[i]
}
}
' pat-file file-1
这会生成:
This is a demo data = A*C*
This is a demo data = XYCD
This is a demo data = A*C*
This is a demo data = BLAH
This is a demo data = A*C*
This is a demo data = M*H
This is a demo data = A*C*
This is a demo data = A*C*
This is a demo data = A*C*
This is a demo data = A*C* and M*H
This is a demo data <tag changed="yes"<name>A*C*</name><phone>9*7*2*2*</phone><code>M*H</code><bankaccount>4*6*7*8*9*8*7</bankaccount></tag>
This is a demo data <tag changed="yes"<name>A*C*</name><phone>9*7*2*2*</phone><code>M*H</code><bankaccount>4*6*7*8*9*8*7</bankaccount></tag>
This is a demo data <tag changed="yes"<name>A*C*</name><phone>9*7*2*2*</phone><code>M*H</code><bankaccount>4*6*7*8*9*8*7</bankaccount></tag>
#####################
# some more lines ...
#####################
This is a demo data = A*C* and XYCD
This is a demo data = XYCD and M*H
This is A*C* and M*H demo data <tag changed="yes"<name>W*n*e*s****</name><phone>9*7*2*2*</phone><code>M*H</code><bankaccount>4*6*7*8*9*8*7</bankaccount></tag>
One last line A*C* A*C*-XYZ ABCDABCD ABCD_XYZ