Bash：使用另一个文件的行查找和替换文件中的行

Question

我有两个文件：masterlist.txt 有数百行 URLs，toupdate.txt 有少量来自 [=15= 的行的更新版本] 需要替换的文件。

我希望能够使用 Bash 自动执行此过程，因为这些列表的创建和使用已经在 bash 脚本中进行。

URL 的服务器部分是发生变化的部分，因此我们可以使用唯一部分进行匹配：/whatever/whatever_user.xml，但是如何查找和替换 masterlist.txt 中的那些行？即如何遍历 toupdate.txt 的每一行并以 /f_SomeName/f_SomeName_user.xml 结尾，找到以 masterlist.txt 结尾的那一行并将整行替换为新的行？

所以 https://123456url.domain.com/26/path/f_SomeName/f_SomeName_user.xml 变成 https://new-123.domain.com/1/path/f_SomeName/f_SomeName_user.xml 例如。

masterlist.txt 的其余部分需要保持完整，因此我们必须只查找和替换具有相同行结尾 (ID) 的不同服务器的行。

结构

masterlist.txt 看起来像这样：

https://123456url.domain.com/26/path/f_SomeName/f_SomeName_user.xml
https://456789url.domain.com/32/path/f_AnotherName/f_AnotherName_user.xml
https://101112url.domain.com/1/path/g_SomethingElse/g_SomethingElse_user.xml
https://222blah11.domain.com/19/path/e_BlahBlah/e_BlahBlah_user.xml
[...]

toupdate.txt 看起来像这样：

https://new-123.domain.com/1/path/f_SomeName/f_SomeName_user.xml
https://foo-254.domain.com/8/path/g_SomethingElse/g_SomethingElse_user.xml

想要的结果

使 masterlist.txt 看起来像：

https://new-123.domain.com/1/path/f_SomeName/f_SomeName_user.xml
https://456789url.domain.com/32/path/f_AnotherName/f_AnotherName_user.xml
https://foo-254.domain.com/8/path/g_SomethingElse/g_SomethingElse_user.xml
https://222blah11.domain.com/19/path/e_BlahBlah/e_BlahBlah_user.xml
[...]

初步检查

我看过 sed 但我不知道如何使用两个文件中的行进行查找和替换？

这是我目前所做的，至少进行了文件处理：

#!/bin/bash

#...

while read -r line; do
    # there's a new link on each line
    link="${line}"
    # extract the unique part from the end of each line
    grabXML="${link##*/}"
    grabID="${grabXML%_user.xml}"
    # if we cannot grab the ID, then just set it to use the full link so we don't have an empty string
    if [ -n "${grabID}" ]; then
        identifier=${grabID}
    else
        identifier="${line}"
    fi
    
    ## the find and replace here? ##    

# we're done when we've reached the end of the file
done < "masterlist.txt"

Answer 1

请您尝试以下操作：

#!/bin/bash

declare -A map
while IFS= read -r line; do
    if [[ $line =~ (/[^/]+/[^/]*\.xml)$ ]]; then
        uniq_part="${BASH_REMATCH[1]}"
        map[$uniq_part]=$line
    fi
done < "toupdate.txt"

while IFS= read -r line; do
    if [[ $line =~ (/[^/]+/[^/]*\.xml)$ ]]; then
        uniq_part="${BASH_REMATCH[1]}"
        if [[ -n ${map[$uniq_part]} ]]; then
            line=${map[$uniq_part]}
        fi
    fi
    echo "$line"
done < "masterlist.txt" > "masterlist_tmp.txt"

# if the result of "masterlist_tmp.txt" is good enough, uncomment the line below
# mv -f -- "masterlist_tmp.txt" "masterlist.txt"

结果：

https://new-123.domain.com/1/path/f_SomeName/f_SomeName_user.xml
https://456789url.domain.com/32/path/f_AnotherName/f_AnotherName_user.xml
https://foo-254.domain.com/8/path/g_SomethingElse/g_SomethingElse_user.xml
https://222blah11.domain.com/19/path/e_BlahBlah/e_BlahBlah_user.xml

[说明]

关联数组map将/f_SomeName/f_SomeName_user.xml等“唯一部分”映射到https://new-123.domain.com/1/path/f_SomeName/f_SomeName_user.xml.
正则表达式 (/[^/]+/[^/]*\.xml)$，如果匹配，则分配给 shell 变量 BASH_REMATCH[1] 从第二个最右边的斜线开始的子字符串扩展到字符串末尾的“.xml”。
在文件“toupdate.txt”的第一个循环中，它生成“唯一部分” 和“填充路径”对作为关联数组的键值对。
在文件“masterlist.txt”的第二个循环中，提取的如果关联值存在，则测试“唯一部分”。如果是这样， line 替换为关联值，即“toupdate.txt”中的行文件。

[备选]
如果文本文件很大，bash 可能不够快。在这种情况下，awk 脚本将更有效地工作：

awk 'NR==FNR {
    if (match([=12=], "/[^/]+/[^/]*\.xml$")) {
        map[substr([=12=], RSTART, RLENGTH)] = [=12=]
    }
    next
}
{
    if (match([=12=], "/[^/]+/[^/]*\.xml$")) {
        full_path = map[substr([=12=], RSTART, RLENGTH)]
        if (full_path != "") {
            [=12=] = full_path
        }
    }
    print
}' "toupdate.txt" "masterlist.txt" > "masterlist_tmp.txt"

[说明]

NR==FNR { BLOCK1; next } { BLOCK2 } 语法是一个常见的习语为每个文件单独切换处理。作为 NR==FNR 条件仅满足参数列表中的第一个文件并跳过 next 语句接下来的块 BLOCK1 仅处理文件“toupdate.txt”。同样，BLOCK2 仅处理文件“masterlist.txt”。
如果函数 match([=25=], pattern) 成功，它会设置 awk 变量 RSTART 到 [=28=] 中匹配子串的起始位置，从文件中读取的当前记录，然后将变量 RLENGTH 设置为匹配子字符串的长度。现在我们可以提取匹配的子字符串，例如 /f_SomeName/f_SomeName_user.xml 通过使用 substr() 函数。
然后我们给数组map赋值，这样子串（唯一的部分）映射到“toupdate.txt”中的整个url。
第二个块的工作原理与第一个块大体相似。如果键对应的值在数组 map 中找到，则记录 ($0) 被替换为键索引的数组的值。

Answer 2

为什么不让 sed 编写自己的脚本 - 生成所需的输出，

sed -e "$(sed -e 's<^\(http[s]*://[^/]*/[^/]*/\)\(.*\)<\|$| s|.*||<' toupdate.txt)" masterlist.txt

哪里

内部sed命令有一个外部和一个内部s替换命令
outer s (s<...<...<) 捕获 scheme://domain/N/ as </code> and rest-of-path <code>$.*$ 作为 </code> 并将它们插入到外部 <code>sed 命令的脚本中
外部 sed 脚本 (\|$| s|.*||) 在 masterlist.txt 中找到 URLs 剩余路径，用 toupdate.txt

s

避免大量反斜杠转义 < 和 | 用作两个 s 命令的分隔符，\|...| 用于 /.../

Bash：使用另一个文件的行查找和替换文件中的行

Bash: Find and replace lines in a file using the lines of another file

bash

file-io

awk

replace

sed

结构

想要的结果

初步检查