如何使用 awk 替换字符串中的备用位置
How to substitute alternate positions in a string using awk
我有一个查找文件,用于搜索 file_2 中的可用记录,如果存在此类记录,则用 # 替换这些记录。目前我的代码正在用 # 替换整个记录,但我需要部分替换它。
我想用#替换字符串的每两个字符。我该怎么做?非常感谢您的帮助。谢谢
代码
awk ' NR==FNR {
s = [=10=];
gsub("[A-Za-z0-9]","#");
a[s] = [=10=];
next
}
{
if match([=10=], ">[^<]+"))
{
str = substr([=10=], RSTART+1, RLENGTH-1)
if (str in a )
{
[=10=] = substr([=10=], 1, RSTART) a[str] substr([=10=], RSTART+RLENGTH)
}
}
lines[FNR]=[=10=]
}
END {for (i=1;i<=FNR;i++)
{
for (str in a )
{
regex = "\<" str "\>"
gsub(regex,a[str],lines[I])
}
}' lookup file_1 > file_2
猫查找
CDX98XSD
@vanti Finserv Co.
11:11 - Capital
MS&CO(NY)
MS&CO(NY)
MS&CO(NY)
猫file_1
<html>
<body>
<hr><br><>span class="table">Records</span><table>
<tr class="data">
<td>@vanti Finserv Co.</td>
<td>11:11 - Capital</td>
<td>MS&CO(NY)</td>
<td>New York</td>
<td>CDX98XSD</td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr class="data">
<td>@vanti Finserv Co.</td>
<td></td>
<td>MS&CO(NY)</td>
<td>2</td>
<td>2</td>
<td>MS&CO(NY)</td>
<td>MS&CO(NY)</td>
<td></td>
</table>
</body>
</html>
预期输出
<html>
<body>
<hr><br><>span class="table">Records</span><table>
<tr class="data">
<td>@##n## F##s##v C##</td>
<td>1##11 - C##I##l</td>
<td>M##C##N##</td>
<td>New York</td>
<td>C##9##S#</td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr class="data">
<td>@##n## F##s##v C##</td>
<td></td>
<td>M##C##N##</td>
<td>2</td>
<td>2</td>
<td>M##C##N##</td>
<td>M##C##N##</td>
<td></td>
</table>
</body>
</html>
Assumptions/Understandings:
- 可以忽略
lookup
中的重复条目(即,我们不会区别对待重复出现的条目)
- 对于
lookup
中的每个白色 space 分隔字符串,我们想用 #
替换第 n/(n+1) 个字符(其中 n
= 2 ,5,8,11,14,17,20,....)
- 对于
lookup
字符串 11:11 - Capital
正确的替换字符串是 1##1# - C##i##l
(与 OP 的 1##11 - C##i##l
相对)
向输入文件添加了以下行(基于 OP 的评论):
# file: lookup
READ 1234
READ
READ NOW
# file: file_1
<td>READ 1234 stuff READ</td>
<td>READ READ NOW</td>
<td>READ NOW READ 9999 York</td>
一个awk
想法:
awk '
FNR==NR { if ([=11=] in lookups) # if duplicate then ...
next # ignore
lookups[[=11=]]=[=11=]
for (i=1;i<=NF;i++) { # loop through list of white space delimited fields
oldstr=$i
newstr=""
while (oldstr) { # while oldstr != ""
len=length(oldstr)
# keep 1st char; replace 2nd/3rd chars if length > 1/2, respectively
newstr=newstr substr(oldstr,1,1) substr("##",1,len-1)
oldstr=substr(oldstr,4) # strip off first 3 characters
}
ndx=index(lookups[[=11=]],$i) # locate position of $i in current line
# replace $i with newstr
lookups[[=11=]]=substr(lookups[[=11=]],1,ndx-1) newstr substr(lookups[[=11=]],ndx+length($i))
}
next
}
{ match_found=1
while (match_found) {
match_found=0
ndx=99999999
len=0
n=99999999
# find earliest and longest match
for (i in lookups) {
curr_len=length(lookups[i])
curr_ndx=index([=11=],i)
if (curr_ndx > 0) {
match_found=1
if (curr_ndx < ndx || (curr_ndx == ndx && curr_len > len)) {
ndx=curr_ndx
len=curr_len
n=i
}
} # if (curr_ndx > 0)
} # for (i in lookup)
if (match_found)
[=11=]=substr([=11=],1,ndx-1) lookups[n] substr([=11=],ndx+len)
} # while ( match_found )
print
}
# uncomment following block to display contents of lookups[]
#END { print "############ lookups[]"
# for (i in lookups)
# print i " => " lookups[i]
# print "############"
# }
' lookup file_1 > file_2
这会生成:
$ cat file_2
<html>
<body>
<hr><br><>span class="table">Records</span><table>
<tr class="data">
<td>@##n## F##s##v C##</td>
<td>1##1# - C##i##l</td>
<td>M##C##N##</td>
<td>New York</td>
<td>C##9##S#</td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr class="data">
<td>@##n## F##s##v C##</td>
<td></td>
<td>M##C##N##</td>
<td>2</td>
<td>2</td>
<td>M##C##N##</td>
<td>M##C##N##</td>
<td></td>
</table>
</body>
</html>
<td>R##D 1##4 stuff R##D</td>
<td>R##D R##D N##</td>
<td>R##D N## R##D 9999 York</td>
只关注差异:
$ diff file_1 file_2
5,7c5,7
< <td>@vanti Finserv Co.</td>
< <td>11:11 - Capital</td>
< <td>MS&CO(NY)</td>
---
> <td>@##n## F##s##v C##</td>
> <td>1##1# - C##i##l</td>
> <td>M##C##N##</td>
9c9
< <td>CDX98XSD</td>
---
> <td>C##9##S#</td>
16c16
< <td>@vanti Finserv Co.</td>
---
> <td>@##n## F##s##v C##</td>
18c18
< <td>MS&CO(NY)</td>
---
> <td>M##C##N##</td>
21,22c21,22
< <td>MS&CO(NY)</td>
< <td>MS&CO(NY)</td>
---
> <td>M##C##N##</td>
> <td>M##C##N##</td>
27,29c27,29
< <td>READ 1234 stuff READ</td>
< <td>READ READ NOW</td>
< <td>READ NOW READ 9999 York</td>
---
> <td>R##D 1##4 stuff R##D</td>
> <td>R##D R##D N##</td>
> <td>R##D N## R##D 9999 York</td>
取消注释 END{...}
块生成:
############ lookups[]
READ NOW => R##D N##
READ => R##D
MS&CO(NY) => M##C##N##
READ 1234 => R##D 1##4
CDX98XSD => C##9##S#
@vanti Finserv Co. => @##n## F##s##v C##
11:11 - Capital => 1##1# - C##i##l
############
我有一个查找文件,用于搜索 file_2 中的可用记录,如果存在此类记录,则用 # 替换这些记录。目前我的代码正在用 # 替换整个记录,但我需要部分替换它。 我想用#替换字符串的每两个字符。我该怎么做?非常感谢您的帮助。谢谢
代码
awk ' NR==FNR {
s = [=10=];
gsub("[A-Za-z0-9]","#");
a[s] = [=10=];
next
}
{
if match([=10=], ">[^<]+"))
{
str = substr([=10=], RSTART+1, RLENGTH-1)
if (str in a )
{
[=10=] = substr([=10=], 1, RSTART) a[str] substr([=10=], RSTART+RLENGTH)
}
}
lines[FNR]=[=10=]
}
END {for (i=1;i<=FNR;i++)
{
for (str in a )
{
regex = "\<" str "\>"
gsub(regex,a[str],lines[I])
}
}' lookup file_1 > file_2
猫查找
CDX98XSD
@vanti Finserv Co.
11:11 - Capital
MS&CO(NY)
MS&CO(NY)
MS&CO(NY)
猫file_1
<html>
<body>
<hr><br><>span class="table">Records</span><table>
<tr class="data">
<td>@vanti Finserv Co.</td>
<td>11:11 - Capital</td>
<td>MS&CO(NY)</td>
<td>New York</td>
<td>CDX98XSD</td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr class="data">
<td>@vanti Finserv Co.</td>
<td></td>
<td>MS&CO(NY)</td>
<td>2</td>
<td>2</td>
<td>MS&CO(NY)</td>
<td>MS&CO(NY)</td>
<td></td>
</table>
</body>
</html>
预期输出
<html>
<body>
<hr><br><>span class="table">Records</span><table>
<tr class="data">
<td>@##n## F##s##v C##</td>
<td>1##11 - C##I##l</td>
<td>M##C##N##</td>
<td>New York</td>
<td>C##9##S#</td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr class="data">
<td>@##n## F##s##v C##</td>
<td></td>
<td>M##C##N##</td>
<td>2</td>
<td>2</td>
<td>M##C##N##</td>
<td>M##C##N##</td>
<td></td>
</table>
</body>
</html>
Assumptions/Understandings:
- 可以忽略
lookup
中的重复条目(即,我们不会区别对待重复出现的条目) - 对于
lookup
中的每个白色 space 分隔字符串,我们想用#
替换第 n/(n+1) 个字符(其中n
= 2 ,5,8,11,14,17,20,....) - 对于
lookup
字符串11:11 - Capital
正确的替换字符串是1##1# - C##i##l
(与 OP 的1##11 - C##i##l
相对)
向输入文件添加了以下行(基于 OP 的评论):
# file: lookup
READ 1234
READ
READ NOW
# file: file_1
<td>READ 1234 stuff READ</td>
<td>READ READ NOW</td>
<td>READ NOW READ 9999 York</td>
一个awk
想法:
awk '
FNR==NR { if ([=11=] in lookups) # if duplicate then ...
next # ignore
lookups[[=11=]]=[=11=]
for (i=1;i<=NF;i++) { # loop through list of white space delimited fields
oldstr=$i
newstr=""
while (oldstr) { # while oldstr != ""
len=length(oldstr)
# keep 1st char; replace 2nd/3rd chars if length > 1/2, respectively
newstr=newstr substr(oldstr,1,1) substr("##",1,len-1)
oldstr=substr(oldstr,4) # strip off first 3 characters
}
ndx=index(lookups[[=11=]],$i) # locate position of $i in current line
# replace $i with newstr
lookups[[=11=]]=substr(lookups[[=11=]],1,ndx-1) newstr substr(lookups[[=11=]],ndx+length($i))
}
next
}
{ match_found=1
while (match_found) {
match_found=0
ndx=99999999
len=0
n=99999999
# find earliest and longest match
for (i in lookups) {
curr_len=length(lookups[i])
curr_ndx=index([=11=],i)
if (curr_ndx > 0) {
match_found=1
if (curr_ndx < ndx || (curr_ndx == ndx && curr_len > len)) {
ndx=curr_ndx
len=curr_len
n=i
}
} # if (curr_ndx > 0)
} # for (i in lookup)
if (match_found)
[=11=]=substr([=11=],1,ndx-1) lookups[n] substr([=11=],ndx+len)
} # while ( match_found )
print
}
# uncomment following block to display contents of lookups[]
#END { print "############ lookups[]"
# for (i in lookups)
# print i " => " lookups[i]
# print "############"
# }
' lookup file_1 > file_2
这会生成:
$ cat file_2
<html>
<body>
<hr><br><>span class="table">Records</span><table>
<tr class="data">
<td>@##n## F##s##v C##</td>
<td>1##1# - C##i##l</td>
<td>M##C##N##</td>
<td>New York</td>
<td>C##9##S#</td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr class="data">
<td>@##n## F##s##v C##</td>
<td></td>
<td>M##C##N##</td>
<td>2</td>
<td>2</td>
<td>M##C##N##</td>
<td>M##C##N##</td>
<td></td>
</table>
</body>
</html>
<td>R##D 1##4 stuff R##D</td>
<td>R##D R##D N##</td>
<td>R##D N## R##D 9999 York</td>
只关注差异:
$ diff file_1 file_2
5,7c5,7
< <td>@vanti Finserv Co.</td>
< <td>11:11 - Capital</td>
< <td>MS&CO(NY)</td>
---
> <td>@##n## F##s##v C##</td>
> <td>1##1# - C##i##l</td>
> <td>M##C##N##</td>
9c9
< <td>CDX98XSD</td>
---
> <td>C##9##S#</td>
16c16
< <td>@vanti Finserv Co.</td>
---
> <td>@##n## F##s##v C##</td>
18c18
< <td>MS&CO(NY)</td>
---
> <td>M##C##N##</td>
21,22c21,22
< <td>MS&CO(NY)</td>
< <td>MS&CO(NY)</td>
---
> <td>M##C##N##</td>
> <td>M##C##N##</td>
27,29c27,29
< <td>READ 1234 stuff READ</td>
< <td>READ READ NOW</td>
< <td>READ NOW READ 9999 York</td>
---
> <td>R##D 1##4 stuff R##D</td>
> <td>R##D R##D N##</td>
> <td>R##D N## R##D 9999 York</td>
取消注释 END{...}
块生成:
############ lookups[]
READ NOW => R##D N##
READ => R##D
MS&CO(NY) => M##C##N##
READ 1234 => R##D 1##4
CDX98XSD => C##9##S#
@vanti Finserv Co. => @##n## F##s##v C##
11:11 - Capital => 1##1# - C##i##l
############