是否有一种简单快速的解决方案来比较 bash 中的两个 csv 文件？

Question

我的问题：我有 2 个大型 csv 文件，有数百万行。

一个文件包含来自我的服务器的数据库备份，看起来像：

securityCode,isScanned
NALEJNSIDO,false
NALSKIFKEA,false
NAPOIDFNLE,true
...

现在我有另一个 CSV 文件，其中包含具有完全相同架构的新代码。

我想比较两者，只找到服务器上没有的代码。因为我的一个朋友生成随机代码，我们希望确定只更新服务器上还没有的代码。

我尝试用 sort -u serverBackup.csv > serverBackupSorted.csv 和 sort -u newCodes.csv > newCodesSorted.csv

对它们进行排序

首先我尝试使用 grep -F -x -f newCodesSorted.csv serverBackupSorted.csv 但是进程被杀死了，因为它占用了太多资源，所以我认为必须有更好的方法

然后我使用 diff 只在 newCodesSorted.csv 中找到新行，例如 diff serverBackupSorted.csv newCodesSorted.csv.

我相信你可以直接告诉 diff 你只想要与第二个文件的区别，但我不明白如何，所以我 grep 输入，知道我cut/remove 以后不需要的字符： diff serverBackupSorted.csv newCodesSorted.csv | grep '>' > greppedCodes

但我相信一定有更好的方法。

所以我问你，如果你有什么想法，如何改进这个方法。

编辑：

comm 到目前为止效果很好。但是我忘了提一件事，服务器上的一些代码已经被扫描了。

但是新代码总是用 isScanned = false 初始化。所以 newCodes.csv 看起来像

securityCode,isScanned
ALBSIBFOEA,false
OUVOENJBSD,false
NAPOIDFNLE,false
NALEJNSIDO,false
NPIAEBNSIE,false
...

我不知道使用 cut -d',' -f1 是否足以将其减少为代码和使用通信。

我试过了，一次用 grep，一次用 comms 得到了不同的结果。所以我有点不确定，哪一个是正确的方法^^

Answer 1

是的！一个被高度低估的工具 comm 非常适合这个。从 here.

偷来的例子

Show lines that only exist in file a: (i.e. what was deleted from a)
comm -23 a b

Show lines that only exist in file b: (i.e. what was added to b)
comm -13 a b

Show lines that only exist in one file or the other: (but not both)
comm -3 a b | sed 's/^\t//'

As noted in the comments, for comm to work the files do need to be sorted beforehand. The following will sort them as a part of the command:
comm -12 <(sort a) <(sort b)

如果您更喜欢坚持使用 diff，您可以在没有 grep 的情况下让它执行您想要的操作：

diff --changed-group-format='%<%>' --unchanged-group-format='' 1.txt 2.txt

然后您可以将该 diff 命令别名为“comp”或类似的名称，以便您：

comp 1.txt 2.txt

如果这是您将来可能经常使用的命令，那可能会很方便。

Answer 2

我认为对文件进行排序会占用大量资源。
当你只想要新行时，你可以尝试使用 -v

选项的 grep

grep -vFxf serverBackup.csv newCodes.csv

或第一次分裂serverBackup.csv

split -a 4  --lines 10000 serverBackup.csv splitted
cp newCodes.csv newCodes.csv.org
for f in splitted*; do
   grep -vFxf "${f}" newCodes.csv > smaller
   mv smaller newCodes.csv
done
rm splitted*

Answer 3

鉴于：

$ cat f1
securityCode,isScanned
NALEJNSIDO,false
NALSKIFKEA,false
NAPOIDFNLE,true

$ cat f2
securityCode,isScanned
NALEJNSIDO,false
NALSKIFKEA,true
NAPOIDFNLE,false
SOMETHINGELSE,true

你可以使用 awk:

$ awk 'FNR==NR{seen[[=11=]]; next} !([=11=] in seen)' f1 f2
NALSKIFKEA,true
NAPOIDFNLE,false
SOMETHINGELSE,true

是否有一种简单快速的解决方案来比较 bash 中的两个 csv 文件？

Is there an easy and fast solution to compare two csv files in bash?

csv

bash

diff

grep

file