如何比较包含许多长字符串的两个文件，然后提取至少有 n 个连续相同字符的行？

Question

我有 2 个大文件，每个文件都包含以不同格式的换行符分隔的长字符串。我需要找到它们之间的相同点和不同点。问题是两个文件的格式不同。

文件一：

9217:NjA5MDAxNdaeag0NjE5NTIx.XUwXRQ.gat8MzuGfkj2pWs7z8z-LBFXQaE:dasda97sda9sdadfghgg789hfg87ghf8fgh87

文件 b：

NjA5MDAxNdaeag0NjE5NTIx.XUwXRQ.gat8MzuGfkj2pWs7z8z-LBFXQaE

所以现在我想从文件 a 中提取包含 NjA5MDAxNdaeag0NjE5NTIx.XUwXRQ.gat8MzuGfkj2pWs7z8z-LBFXQaE 的整行到一个新文件中，并在文件 a 中删除这一行。

我已经尝试通过 meld 实现这一点，并且达到了这样的程度，它至少只会向我展示相似之处。假设文件 a 有 3000 行，文件 b 有 120 行，现在我想找到至少有 n 个连续相同字符的行，并将它们从文件 a 中删除。

我找到了 this 并因此尝试像这样使用差异：

  diff  --unchanged-line-format='%L' --old-line-format='' \
  --new-line-format='' a.txt b.txt

这没有做任何事情我没有任何输出所以我猜它以 0 退出并且没有找到任何东西。

我怎样才能完成这项工作？我有 Linux 和 Windows 可用。

Answer 1

鉴于文件的格式，最有效的实施方式如下：

将所有 b 字符串加载到 [hashtable] 或 [HashSet[string]]
通过以下方式过滤 a 的内容：
- 用String.Split(':')或类似的
- 检查它是否存在于第 1 步的集合中

$FilterStrings = [System.Collections.Generic.HashSet[string]]::new(
    [string[]]@(
        Get-Content .\path\to\b
    )
)

Get-Content .\path\to\a |Where-Object {
    # Split the line into the prefix, middle, and suffix;
    # Discard the prefix and suffix
    $null,$searchString,$null = $_.Split(":", 3)

    if($FilterStrings.Contains($searchString)){
        # we found a match, write it to the new file
        $searchString |Add-Content .\path\to\matchedStrings.txt

        # make sure it isn't passed through
        $false
    }
    else {
        # substring wasn't found to be in `b`, let's pass it through
        $true
    }
} |Set-Content .\path\to\filteredStrings.txt

如何比较包含许多长字符串的两个文件，然后提取至少有 n 个连续相同字符的行？

How to compare two files containing many long strings then extract lines with at least n consecutive identical chars?

linux

windows

powershell

diff

string-comparison