删除文件中的n个重复行

Question

1。简要

我有一个很大的文本文件 (14MB)。我需要删除文件中的文本块，包含 5 个重复行。

如果有可能让它使用任何免费方法就好了。

我使用 Windows，但 Cygwin 解决方案也不错。

2。设置

1。文件结构

I have a file test1.md。它由重复的块组成。每个块有 10 行。文件结构（使用 PCRE 正则表达式）

Millionaire
\d{18}
QUESTION.*
.*
.*
.*
.*
.*
.*
.*
Millionaire
\d{18}
QUESTION.*
.*
.*
.*
.*
.*
.*
.*

test1.md 除了 10 行块之外没有其他行和文本。它没有空行和行数大于或小于 10 的块。

2。文件内容示例

Millionaire
123456788763237476
QUESTION|2402394827049882049
Who is the greatest Goddess of the world?
Sasha
Kristina
Sasha
Katya
Valeria
AuthorOfQuestion
Millionaire
459385734954395394
QUESTION|9845495845948594999
Where Sasha live?
Novgorod
St. Petersburg
Kazan
Novgorod
Chistopol
Another author
Millionaire
778845225202502505
QUESTION|984ACFBBADD8594999A
Who is the greatest Goddess of the world?
Sasha
Kristina
Sasha
Katya
Valeria
Millionaire
AuthorOfQuestion
Millionaire
903034225025025568
QUESTION|ABC121980850540445C
Another question.
Katya
Sasha
Kazan
Chistopol
Katya
Unknown author
Millionaire
450602938477581129
QUESTION|453636EE4534345AC5E
Where Sasha live?
Novgorod
St. Petersburg
Kazan
Novgorod
Chistopol
Another author

从示例中可以看出，test1.md 重复了 7 行块。例如，这些块是：

Who is the greatest Goddess of the world?
Sasha
Kristina
Sasha
Katya
Valeria
AuthorOfQuestion

和

Where Sasha live?
Novgorod
St. Petersburg
Kazan
Novgorod
Chistopol
Another author

3。预期行为

我需要删除所有重复块。在我的示例中，我需要得到：

Millionaire
123456788763237476
QUESTION|2402394827049882049
Who is the greatest Goddess of the world?
Sasha
Kristina
Sasha
Katya
Valeria
AuthorOfQuestion
Millionaire
459385734954395394
QUESTION|9845495845948594999
Where Sasha live?
Novgorod
St. Petersburg
Kazan
Novgorod
Chistopol
Another author
Millionaire
778845225202502505
QUESTION|984ACFBBADD8594999A
Millionaire
903034225025025568
QUESTION|ABC121980850540445C
Another question.
Katya
Sasha
Kazan
Chistopol
Katya
Unknown author
Millionaire
450602938477581129
QUESTION|453636EE4534345AC5E

If 7 lines duplicate 7 lines, which is already used in my file, duplicate 7 lines was removed.
If 1 (also 2—4) line duplicate 1 line, which is already used in my file, duplicate 1 line doesn't remove.在示例中，单词 Sasha、Kazan、Chistopol 和 Katya 重复，但这些单词不会删除。

4。没有帮助

谷歌搜索
我发现 Unix 命令 sort、sed 和 awk 可以解决类似的任务，但我没有找到如何使用这些命令解决我的任务。

5。不提供

请不要手动删除每个文本块。可能，我有大约几千个不同的重复文本块。手动删除所有重复项可能会花费很多时间。

Answer 1

您可以通过以下正则表达式使用 Sublime Text 的查找和替换功能：

替换内容：\A(?1)*?((^.*$\n){5})(?1)*?\K+
替换为：

（即什么都不替换）

这将找到文档中稍后存在的 5 行块，并删除 duplicate/second 出现的这 5 行（以及任何紧邻它的行），留下其他行（即原始的5 行是重复的，所有其他行）未受影响。

遗憾的是，由于正则表达式的性质，您需要多次执行此操作才能删除所有重复项。继续调用 "Replace" 可能比 "Replace All" 更容易，并且每次都必须重新打开面板。（不知何故 \K 在这里按预期工作，despite a report of it not working with "Replace".）

Answer 2

这里有一个awk+sed的方法可以满足你的要求。

$ sed '0~5 s/$/\n/g' file | awk -v RS= '!([=10=] in a){a[[=10=]];print}'
Who is the greatest Goddess of the world?
Sasha
Kristina
Katya
Valeria
Where Sasha live?
St. Petersburg
Kazan
Novgorod
Chistopol
Another question.
Sasha
Kazan
Chistopol
Katya

Answer 3

请在下面找到 Windows 电源 Shell 的代码。代码无论如何都没有优化。请将以下代码中的 test.txt 编辑到文件中，并确保工作目录是 tha.输出的是一个csv文件，打开后可以按excel排序，删除第一列即可删除索引。我不知道为什么会出现这些索引以及如何摆脱它。这是我第一次尝试使用 Windows Power Shell，我找不到语法来声明具有固定大小的字符串数组。尽管如此，它仍然有效。

$d=Get-Content test.txt
$chk=@{};
$tot=$d.Count
$unique=@{}
$g=0;
$isunique=1;
for($i=0;$i -lt $tot){$isunique=1;
$chk[0]=$d[$i]

$chk[1]=$d[$i+1]

$chk[2]=$d[$i+2]

$chk[3]=$d[$i+3]

$chk[4]=$d[$i+4]

$i=$i+5

for($j=0;$j -lt $unique.count){
if($unique[$j] -eq $chk[0]){
if($unique[$j+1] -eq $chk[1]){

if($unique[$j+2] -eq $chk[2]){

if($unique[$j+3] -eq $chk[3]){

if($unique[$j+4] -eq $chk[4]){ 

$isunique=0
break
}
}
}
}
}
$j=$j+5

}



if ($isunique){
$unique[$g]=$chk[0] 

$unique[$g+1]=$chk[1] 
$unique[$g+2]=$chk[2] 
$unique[$g+3]=$chk[3] 
$unique[$g+4]=$chk[4] 
$g=$g+5;

}

}


$unique | out-file test2.csv

![截图]http://imgur.com/a/ZP9T5

有实力的Shell有经验的请优化代码。我尝试了 .Contains .Add 等，但没有得到想要的结果。希望对你有帮助。

Answer 4

您的要求不清楚如何处理 5 行的重叠块，如何处理输入末尾少于 5 行的块，以及各种其他边缘情况，所以这是一种识别方法重复的 5（或更少）行块：

$ cat tst.awk
{
    for (i=1; i<=5; i++) {
        blockNr = NR - i + 1
        if ( blockNr > 0 ) {
            blocks[blockNr] = (blockNr in blocks ? blocks[blockNr] RS : "") [=10=]
        }
    }
}
END {
    for (blockNr=1; blockNr in blocks; blockNr++) {
        block = blocks[blockNr]
        print "----------- Block", blockNr, (seen[block]++ ? "***** DUP *****" : "ORIG")
        print block
    }
}

.

$ awk -f tst.awk file
----------- Block 1 ORIG
Who is the greatest Goddess of the world?
Sasha
Kristina
Katya
Valeria
----------- Block 2 ORIG
Sasha
Kristina
Katya
Valeria
Where Sasha live?
----------- Block 3 ORIG
Kristina
Katya
Valeria
Where Sasha live?
St. Petersburg
----------- Block 4 ORIG
Katya
Valeria
Where Sasha live?
St. Petersburg
Kazan
----------- Block 5 ORIG
Valeria
Where Sasha live?
St. Petersburg
Kazan
Novgorod
----------- Block 6 ORIG
Where Sasha live?
St. Petersburg
Kazan
Novgorod
Chistopol
----------- Block 7 ORIG
St. Petersburg
Kazan
Novgorod
Chistopol
Who is the greatest Goddess of the world?
----------- Block 8 ORIG
Kazan
Novgorod
Chistopol
Who is the greatest Goddess of the world?
Sasha
----------- Block 9 ORIG
Novgorod
Chistopol
Who is the greatest Goddess of the world?
Sasha
Kristina
----------- Block 10 ORIG
Chistopol
Who is the greatest Goddess of the world?
Sasha
Kristina
Katya
----------- Block 11 ***** DUP *****
Who is the greatest Goddess of the world?
Sasha
Kristina
Katya
Valeria
----------- Block 12 ORIG
Sasha
Kristina
Katya
Valeria
Another question.
----------- Block 13 ORIG
Kristina
Katya
Valeria
Another question.
Sasha
----------- Block 14 ORIG
Katya
Valeria
Another question.
Sasha
Kazan
----------- Block 15 ORIG
Valeria
Another question.
Sasha
Kazan
Chistopol
----------- Block 16 ORIG
Another question.
Sasha
Kazan
Chistopol
Katya
----------- Block 17 ORIG
Sasha
Kazan
Chistopol
Katya
Where Sasha live?
----------- Block 18 ORIG
Kazan
Chistopol
Katya
Where Sasha live?
St. Petersburg
----------- Block 19 ORIG
Chistopol
Katya
Where Sasha live?
St. Petersburg
Kazan
----------- Block 20 ORIG
Katya
Where Sasha live?
St. Petersburg
Kazan
Novgorod
----------- Block 21 ***** DUP *****
Where Sasha live?
St. Petersburg
Kazan
Novgorod
Chistopol
----------- Block 22 ORIG
St. Petersburg
Kazan
Novgorod
Chistopol
----------- Block 23 ORIG
Kazan
Novgorod
Chistopol
----------- Block 24 ORIG
Novgorod
Chistopol
----------- Block 25 ORIG
Chistopol

您可以以此为基础：

打印每个 ORIG 块中尚未打印的行，方法是使用它们的 blockNr 加上该块中的当前行号（提示：（split(block,lines,RS)）和
弄清楚如何处理您未指定的要求。

Answer 5

这里有一个简单的解决方案（如果您可以访问 GNU sed、sort 和 uniq）：

sed 's/^Millionaire/\x0&/' file | sort -z -k4 | uniq -z -f3 | tr -d '[=10=]0'

稍微解释一下：

因为你所有的块都以word/lineMillionaire开头，我们可以通过在前面加上NUL 个字符到每个 Millionaire;
然后我们对那些 NUL 分隔的块进行排序（使用 -z 标志），但忽略前 3 个字段（在本例中为行：Millionaire、\d+, QUESTION|ID...), 使用 -k/--key 选项，开始位置是字段 4 （在你的例子中是第 4 行），停止位置是结束块的；
排序后，我们可以使用 uniq 过滤掉重复项，再次使用 NUL 分隔符而不是换行符 (-z)，并忽略前 3 个字段（使用-f/--skip-fields);
最后，我们用 tr.

NUL

一般来说，只要有办法将文件拆分成块，像这样删除重复块的解决方案就应该有效。请注意，block-equality 可以在字段子集上定义（就像我们上面所做的那样）。

删除文件中的n个重复行

Delete n duplicate lines in a file

sorting

awk

sed

duplicates

sublimetext