在一个非常大的文件中搜索和替换字符串

Question

我更喜欢 shell 命令来完成任务。我有一个非常非常大的文件 -- 大约 2.8 GB，内容是 JSON。一切都在一条线上，我被告知那里至少有 150 万条记录。

我必须准备文件以供使用。每条记录必须独占一行。示例：

{"RomanCharacters":{"Alphabet":[{"RecordId":"1",...]},{"RecordId":"2",...},{"RecordId":"3",...},{"RecordId":"4",...},{"RecordId":"5",...} }}

或者，使用以下...

{"Accounts":{"Customer":[{"AccountHolderId":"9c585258-c94c-442b-a2f0-1ebbcc274795","Title":"Mrs","Forename":"Tina","Surname":"Wright","DateofBirth":"1988-01-01","Contact":[{"Contact_Info":"9168777943","TypeId":"Mobile Number","PrimaryFlag":"No","Index":"1","Superseded":"No" },{"Contact_Info":"9503588153","TypeId":"Home Telephone","PrimaryFlag":"Yes","Index":"2","Superseded":"Yes" },{"Contact_Info":"acne.pimple@microchimerism.com","TypeId":"Email Address","PrimaryFlag":"No","Index":"3","Superseded":"No" },{"Contact_Info":"swati.singh@microchimerism.com","TypeId":"Email Address","PrimaryFlag":"Yes","Index":"4","Superseded":"Yes" }, {"Contact_Info":"christian.bale@hollywood.com","TypeId":"Email Address","PrimaryFlag":"No","Index":"5","Superseded":"NO" },{"Contact_Info":"15482475584","TypeId":"Mobile_Phone","PrimaryFlag":"No","Index":"6","Superseded":"No" }],"Address":[{"AddressPtr":"5","Line1":"Flat No.14","Line2":"Surya Estate","Line3":"Baner","Line4":"Pune ","Line5":"new","Addres_City":"pune","Country":"India","PostCode":"AB100KP","PrimaryFlag":"No","Superseded":"No"},{"AddressPtr":"6","Line1":"A-602","Line2":"Viva Vadegiri","Line3":"Virar","Line4":"new","Line5":"banglow","Addres_City":"Mumbai","Country":"India","PostCode":"AB10V6T","PrimaryFlag":"Yes","Superseded":"Yes"}],"Account":[{"Field_A":"6884133655531279","Field_B":"887.07","Field_C":"A Loan Product",...,"FieldY_":"2015-09-18","Field_Z":"24275627"}]},{"AccountHolderId":"92a5788f-cd8f-423d-ae5f-4eb0ceb457fd","_Title":"Dr","_Forename":"Christopher","_Surname":"Carroll","_DateofBirth":"1977-02-02","Contact":[{"Contact_Info":"9168777943","TypeId":"Mobile Number","PrimaryFlag":"No","Index":"7","Superseded":"No" },{"Contact_Info":"9503588153","TypeId":"Home Telephone","PrimaryFlag":"Yes","Index":"8","Superseded":"Yes" },{"Contact_Info":"acne.pimple@microchimerism.com","TypeId":"Email Address","PrimaryFlag":"No","Index":"9","Superseded":"No" },{"Contact_Info":"swati.singh@microchimerism.com","TypeId":"Email Address","PrimaryFlag":"Yes","Index":"10","Superseded":"Yes" }],"Address":[{"AddressPtr":"11","Line1":"Flat No.14","Line2":"Surya Estate","Line3":"Baner","Line4":"Pune ","Line5":"new","Addres_City":"pune","Country":"India","PostCode":"AB11TXF","PrimaryFlag":"No","Superseded":"No"},{"AddressPtr":"12","Line1":"A-602","Line2":"Viva Vadegiri","Line3":"Virar","Line4":"new","Line5":"banglow","Addres_City":"Mumbai","Country":"India","PostCode":"AB11O8W","PrimaryFlag":"Yes","Superseded":"Yes"}],"Account":[{"Field_A":"4121879819185553","Field_B":"887.07","Field_C":"A Loan Product",...,"Field_X":"2015-09-18","Field_Z":"25679434"}]},{"AccountHolderId":"4aa10284-d9aa-4dc0-9652-70f01d22b19e","_Title":"Dr","_Forename":"Cheryl","_Surname":"Ortiz","_DateofBirth":"1977-03-03","Contact":[{"Contact_Info":"9168777943","TypeId":"Mobile Number","PrimaryFlag":"No","Index":"13","Superseded":"No" },{"Contact_Info":"9503588153","TypeId":"Home Telephone","PrimaryFlag":"Yes","Index":"14","Superseded":"Yes" },{"Contact_Info":"acne.pimple@microchimerism.com","TypeId":"Email Address","PrimaryFlag":"No","Index":"15","Superseded":"No" },{"Contact_Info":"swati.singh@microchimerism.com","TypeId":"Email Address","PrimaryFlag":"Yes","Index":"16","Superseded":"Yes" }],"Address":[{"AddressPtr":"17","Line1":"Flat No.14","Line2":"Surya Estate","Line3":"Baner","Line4":"Pune ","Line5":"new","Addres_City":"pune","Country":"India","PostCode":"AB12SQR","PrimaryFlag":"No","Superseded":"No"},{"AddressPtr":"18","Line1":"A-602","Line2":"Viva Vadegiri","Line3":"Virar","Line4":"new","Line5":"banglow","Addres_City":"Mumbai","Country":"India","PostCode":"AB12BAQ","PrimaryFlag":"Yes","Superseded":"Yes"}],"Account":[{"Field_A":"3288214945919484","Field_B":"887.07","Field_C":"A Loan Product",...,"Field_Y":"2015-09-18","Field_Z":"66264768"}]}]}}

最终结果应该是：

{"RomanCharacters":{"Alphabet":[{"RecordId":"1",...]},
{"RecordId":"2",...},
{"RecordId":"3",...},
{"RecordId":"4",...},
{"RecordId":"5",...} }}

尝试的命令：

sed -e 's/,{"RecordId"/}]},\n{"RecordId"/g' sample.dat
awk '{gsub(",{\"RecordId\"",",\n{\"RecordId\"",[=14=]); print [=14=]}' sample.dat

尝试的命令对小文件非常有效。但它不适用于我必须操作的 2.8 GB 文件。 Sed 在 10 分钟后无缘无故中途退出，什么也没做。几个小时后，Awk 因分段错误（核心转储）原因而出错。我尝试了 perl 的搜索和替换，并收到一条错误消息 "Out of memory".

任何帮助/想法都会很棒！

关于我的机器的附加信息：

超过 105 GB 的磁盘 space 可用。
8 GB 内存
4 核 CPU
运行 Ubuntu 14.04

Answer 1

尝试使用 } 作为记录分隔符，例如在 Perl 中：

perl -l -0175 -ne 'print $_, $/' < input

您可能需要粘回仅包含 } 的行。

Answer 2

这通过不将数据视为单个记录来避免内存问题，但在性能方面可能会走得太远（一次处理单个字符）。另请注意，built-in RT 变量（当前记录分隔符的值）需要 gawk：

$ cat j.awk
BEGIN { RS="[[:print:]]" }
RT == "{" { bal++}
RT == "}" { bal-- }
{ printf "%s", RT }
RT == "," && bal == 2 { print "" }
END { print "" }

$ gawk -f j.awk j.txt
{"RomanCharacters":{"Alphabet":[{"RecordId":"1",...]},
{"RecordId":"2",...},
{"RecordId":"3",...},
{"RecordId":"4",...},
{"RecordId":"5",...} }}

Answer 3

既然您已经用 sed、awk 和 perl 标记了您的问题，我认为您真正需要的是对工具的推荐。虽然这有点像 off-topic，但我相信 jq 可以用于此目的。它会比 sed 或 awk 更好，因为它实际上 理解 JSON。此处使用 jq 显示的所有内容也可以通过一些编程在 perl 中完成。

假设内容如下（基于您的样本）：

{"RomanCharacters":{"Alphabet": [ {"RecordId":"1","data":"data"},{"RecordId":"2","data":"data"},{"RecordId":"3","data":"data"},{"RecordId":"4","data":"data"},{"RecordId":"5","data":"data"} ] }}

您可以轻松地将其重新格式化为 "prettify" 它：

$ jq '.' < data.json
{
  "RomanCharacters": {
    "Alphabet": [
      {
        "RecordId": "1",
        "data": "data"
      },
      {
        "RecordId": "2",
        "data": "data"
      },
      {
        "RecordId": "3",
        "data": "data"
      },
      {
        "RecordId": "4",
        "data": "data"
      },
      {
        "RecordId": "5",
        "data": "data"
      }
    ]
  }
}

我们可以深入挖掘数据以仅检索您感兴趣的记录（无论它们包含在什么中）：

$ jq '.[][][]' < data.json
{
  "RecordId": "1",
  "data": "data"
}
{
  "RecordId": "2",
  "data": "data"
}
{
  "RecordId": "3",
  "data": "data"
}
{
  "RecordId": "4",
  "data": "data"
}
{
  "RecordId": "5",
  "data": "data"
}

这对于人类和处理内容的 awk 等工具来说都更具可读性 line-by-line。如果你想加入你的行来处理你的问题，awk 变得更加简单：

$ jq '.[][][]' < data.json | awk '{printf("%s ",[=13=])} /}/{printf("\n")}'
{   "RecordId": "1",   "data": "data" }
{   "RecordId": "2",   "data": "data" }
{   "RecordId": "3",   "data": "data" }
{   "RecordId": "4",   "data": "data" }
{   "RecordId": "5",   "data": "data" }

或者，正如@peak 在评论中建议的那样，通过使用 jq 的 -c（紧凑输出）选项完全消除 thie 的 awk 部分：

$ jq -c '.[][][]' < data.json
{"RecordId":"1","data":"data"}
{"RecordId":"2","data":"data"}
{"RecordId":"3","data":"data"}
{"RecordId":"4","data":"data"}
{"RecordId":"5","data":"data"}

Answer 4

关于 perl：尝试将输入行分隔符 $/ 设置为 },，如下所示：

#!/usr/bin/perl
$/= "},"; 
while (<>){
   print "$_\n"; 
}'

或者，作为 one-liner:

$ perl -e '$/="},";while(<>){print "$_\n"}' sample.dat

Answer 5

使用此处提供的示例数据（以 {Accounts:{Customer... 开头的数据），解决此问题的方法是读取文件并在读取时计算文件的数量$/ 中定义的分隔符。每计数 10,000 个定界符，它就会写入一个新文件。对于找到的每个定界符，它都会给它一个新行。脚本如下所示：

#!/usr/bin/perl

$base="/home/dat789/incoming";
#$_="sample.dat";

$/= "}]},";   # delimiter to find and insert new line after
$n = 0;
$match="";
$filecount=0;
$recsPerFile=10000;   # set number of records in a file

print "Processing " . $_ ."\n";

while (<>){
   if ($n < $recsPerFile) {
      $match=$match.$_."\n";
      $n++;
      print "."; #This is so that we'd know it has done something
   }    
   else {
      my $newfile="partfile".$recsPerFile."-".$filecount . ".dat";
      open ( OUTPUT,'>', $newfile );
      print OUTPUT $match;
      $match="";
      $filecount++;   
      $n=0;
     print "Wrote file " .  $newfile . "\n";
   }
}

print "Finished\n\n";

我已将此脚本用于 2.8 GB 的大文件，其中的内容是未格式化的 one-liner JSON。生成的输出文件将缺少正确的 JSON 页眉和页脚，但这很容易修复。

非常感谢你们的贡献！

在一个非常大的文件中搜索和替换字符串

Search and replace string in a very big file

json

perl

awk

large-files

data-manipulation