从大文件中删除多行长行（TCL 或 shell）

Question

我有一个 2.5G 长的 ascii 文件，大约有 370 万行。有些线路很长。行将包含有趣的字符，cmd 可能会将其解释为转义字符或特殊字符。（斜杠、反斜杠、各种大括号等）

我有一系列特定的 grep 命令，它们将从文件中提取 16 行。我想从大文件中删除那 16 行。

grep pat1 bigfile | grep -v pat2 | grep -v pat3 | grep -v pat4 > temp

temp 中的行长约 10MB。

现在我想反转该选择，以便从大文件中删除临时文件。

我试过了

grep -v -f temp bigfile > newbigfile

结果为 "grep: Memory exhausted"。

我有 unix shell 和简单的 TCL 脚本可供我使用。

谢谢格特

Answer 1

虽然在内存中保留几十 MB 对于 Tcl 程序来说是微不足道的，但如果可以的话，您不想一次在内存中保留所有 2.5GB。这意味着我们希望将要排除的行保留在内存中并通过以下方式流式传输数据：

# Load the exclusions into a list
set f [open "temp"]
set linesToExclude [split [read $f] "\n"]
close $f

# Stream the main data through...
set fIn [open "bigfile"]
set fOut [open "newbigfile" "w"]
while {[gets $fIn line] >= 0} {
    # Only print the line if it isn't in our exclusions
    if {$line ni $linesToExclude} {  # 'ni' for Not In
        puts $fOut $line
    }
}
close $fOut
close $fIn

一般来说，我不想处理超过几百字节长的文本行。除此之外，它开始感觉像是在处理二进制数据，即使它是正式的文本……

Answer 2

名称 "temp" 表明您并不真正需要该文件。然后你可以像这样在 Tcl 中完成整个事情：

set fIn [open "bigfile"]
set fOut [open "newbigfile" "w"]
while {[gets $fIn line] >= 0} {
    # Skip the unwanted lines
    if {[regexp pat1 $line] && \
      ![regexp pat2 $line] && \
      ![regexp pat3 $line] && \
      ![regexp pat4 $line]} continue
    # Print lines that made it through
    puts $fOut $line
}
close $fOut
close $fIn

我不知道执行转换所花费的时间会做什么，或者这是否值得关注。

从大文件中删除多行长行（TCL 或 shell）

remove multiple long lines from large file (TCL or shell)

grep

tcl