从散列和数组的散列中删除重复的键(并确保任何由此产生的空散列也被删除)

Remove duplicate keys from a hash of hashes and arrays (and ensure any resulting empty hashs are also removed)

我有一个 JSON 格式的数据流,我的脚本从内部网站访问该数据流。我的脚本使用 JSON.pm 将 JSON 转换为 perl 散列(我在 RHEL 6.9 上使用 perl 5.10.1)

在这个散列中有多个嵌套散列和嵌套数组,其中一些嵌套在大散列内部的其他 hashes/arrays 中。

我需要遍历哈希的整个结构,包括所有数组和嵌套哈希,并删除整个结构中任何位置的任何键,这些键与任何其他键共享相同的名称(仅针对特定键虽然名字)。

此外,由于数据的结构,一些嵌套的散列只有现在被删除的键,将一些键的值保留为空散列。我还需要删除那些其值为空散列的键

这是我转换为 perl 后的数据:

$VAR1 = {
          'cat' => 'meow',
          'dog' => [
                     {
                        'a' => {
                                  'husky' => {
                                                'name' => 'fred'
                                             },
                                  'chow' => {
                                               'name' => 'barney'
                                            }
                               },
                     },
                     {
                        'b' => {
                                  'husky' => 'wilma',
                                  'lab' => 'betty'
                               },
                        'c' => 'pebbles' # yes this is intentionally a scalar in the example
                     },
                     {
                        'd' => {
                                  'shihtzu' => 'bambam'
                               },
                     },
                     {
                        'e' => {
                                  'husky' => 'dino'
                               },
                     },
                   ],
        }

我们要删除所有名为 'husky'

的键

它应该是这样的:

$VAR1 = {
          'cat' => 'meow',
          'dog' => [
                     {
                        'a' => {
                                  'chow' => {
                                               'name' => 'barney'
                                            }
                               },
                     },
                     {
                        'b' => {
                                  'labrador' => 'betty'
                               },
                        'c' => 'pebbles'
                     },
                     {
                        'd' => {
                                  'shihtzu' => 'bambam'
                               },
                     },
                   ],
        }

这是我添加@Shawn 的代码并对其进行调整后得到的结果(这非常接近,但我们需要考虑空哈希值:

$VAR1 = {
          'cat' => 'meow',
          'dog' => [
                     {
                        'a' => {
                                  'chow' => {
                                               'name' => 'barney'
                                            }
                               },
                     },
                     {
                        'b' => {
                                  'lab' => 'betty'
                               },
                        'c' => 'pebbles' # yes this is intentionally a scalar in the example
                     },
                     {
                        'd' => {
                                  'shihtzu' => 'bambam'
                               },
                     },
                     {
                        'e' => {},
                     },
                   ]
        }

我尝试了在 SO 和 perlmonks 上其他地方发现的一些变体。 keys %$_ == 0!%$_ 等等。但是 none 似乎可以使用这个散列切片。

代码:

use 5.008008;
use strict;
use warnings;
use English; # I know I know, don't use English...
use JSON;
use YAML::Tiny qw(Dump);
# proprietary modules I wrote added here, which themselves load in LWP, HTTP::Cookie and others, and they do the bulk of building and sending the request. They are the back end to this script's front end.

[-snipped a ton of code-]

sub _count_keys
{
    my ($j, $seen) = @ARG;
    my $type = ref $j;
    if ($type eq "ARRAY")
    {
        for (@{$j})
        {
            _count_keys($ARG, $seen);
        }
    }
    elsif ($type eq "HASH")
    {
        while (my ($key, $val) = each %{$j})
        {
            $seen->{$key}++;
            if (ref $val)
            {
                _count_keys($val, $seen);
            }
        }
    }
    return $seen;
}

sub _remove_duplicate_keys
{
    my ($j, $seen) = @ARG;
    $seen //= _count_keys($j, {});

    my $type = ref $j;
    if ($type eq "ARRAY")
    {
        return [ map { _remove_duplicate_keys($ARG, $seen) } @{$j} ];
    }
    elsif ($type eq "HASH")
    {
        my %obj = %{$j};
        delete @obj{grep { $seen->{$ARG} > 1 and $ARG eq 'keyNameToBeExcluded'} keys %obj};
# Here is where I have been putting another delete line but I can't seem to find the right parameters for the grep to make it delete the empty anon hashes. Example of what I tried is the next comment below
#        delete @obj{grep { $seen->{$ARG} > 1 and keys $ARG{assetDetails} == 0 } keys %obj};

        while (my ($key, $val) = each %obj)
        {
            if (ref $val)
            {
                $obj{$key} = _remove_duplicate_keys($val, $seen);
            }
        }
        return \%obj;
    }
    else
    {
        return $j;
    }
}

sub _process_json
{
    my $JSONOUTPUT   = shift;
    my $OPTIONS      = shift;

    # Change true to 1 and false to 0 to prevent blessed objects from appearing in the JSON, which prevents the YAML::Tiny module from barfing
    foreach (@{$JSONOUTPUT})
    {
        s{true(,\n)}{1}gxms;
        s{false(,\n)}{0}gxms;
    }

    my $JSONPERLOBJ  = JSON->new->utf8->decode(@{$JSONOUTPUT});

# Test code below here; real code not in use while I test getting the output right.
use Data::Dumper;
my $BEFORE = $JSONPERLOBJ;
my $AFTER = _remove_duplicate_keys($JSONPERLOBJ);
#    $JSONPERLOBJ = _remove_duplicate_keys($JSONPERLOBJ);
#print Dumper $BEFORE;
print Dumper $AFTER;
exit 1;
# End test code
}
sub _main
{
    [-snip private code-]
    my @JSONOUTPUT = $RESPONSE->decoded_content;
    my $RC = _process_json(\@JSONOUTPUT, $OPTIONS);

    exit ($RC == 1)?0:1;
}

我认为这符合您的要求:

#!/usr/bin/perl
use warnings;
use strict;
use feature qw/say/;
use JSON::XS; # Better than JSON; also see JSON::MaybeXS

my $j = <<EOJSON;
{
  "foo": 1,
  "bar": {
      "foo": true,
      "baz": false
      },
  "dog": "woof",
  "cat": [ { "foo": 3 } ]
}
EOJSON

sub count_keys {
  my ($j, $seen) = @_;
  my $type = ref $j;
  if ($type eq "ARRAY") {
    count_keys($_, $seen) for @$j;
  } elsif ($type eq "HASH") {
    while (my ($key, $val) = each %$j) {
      $seen->{$key}++;
      count_keys($val, $seen) if ref $val;
    }
  }
  return $seen;
}

sub remove_dups {
  my ($j, $seen) = @_;
  $seen //= count_keys($j, {});

  my $type = ref $j;
  if ($type eq "ARRAY") {
    return [ map { remove_dups($_, $seen) } @$j ];
  } elsif ($type eq "HASH") {
    my %obj = %$j;
    delete @obj{grep { $seen->{$_} > 1 } keys %obj};
    while (my ($key, $val) = each %obj) {
      $obj{$key} = remove_dups($val, $seen) if ref $val;
    }
    return \%obj;
  } else {
    return $j;
  }
}

my $parsed = decode_json $j;
my $printer = JSON::XS->new->pretty->canonical;
say "Before:";
print $printer->encode($parsed);
say "After:";
my $dedup = remove_dups $parsed;
print $printer->encode($dedup);

产生

Before:
{
   "bar" : {
      "baz" : false,
      "foo" : true
   },
   "cat" : [
      {
         "foo" : 3
      }
   ],
   "dog" : "woof",
   "foo" : 1
}
After:
{
   "bar" : {
      "baz" : false
   },
   "cat" : [
      {}
   ],
   "dog" : "woof"
}

编辑解释:

第一次在表示 json 值(不必是 json 对象)的 perl 数据结构上调用 remove_dups,它调用 count_keys 递归遍历结构并创建所有键的散列和每个键出现的次数。然后它再次递归地遍历结构,返回一个深副本,没有在原始文件中出现不止一次的键。

这条线是真正的魔法:

delete @obj{grep { $seen->{$_} > 1 } keys %obj};

它使用散列切片一次删除一堆键,grep位返回出现不止一次的键列表。 More information 在切片上。

我认为 可用于删除重复项,看起来不错。

后续问题是我们最终可能会得到空结构,这些结构也需要删除。但是也可能有只包含空结构等的结构,我认为所有这些都需要消失。

我使用了问题中的 desired-result-hashref(我从中删除了一个 name=>... 以便没有重复项)并添加了一些空洞的麻烦。

use warnings;
use strict;
use feature 'say';
use Data::Dump qw(dd pp);

my $hr = {
    'cat' => 'meow',
    'dog' => [
        { 'a' => { 'chow' =>  { 'name' => 'barney' } }  },
        { 'b' => { 'lab' => 'betty' }, 'c' => 'pebbles' },
        { 'd' => { 'shihtzu' => 'bambam' }              },
        {   # all of the following need to go, and this hashref
            'e' => { },  
            'f' => { noval => { } },
            'g' => [ { }, { nada => { } }, [ ] ],
        },
    ],  
};
dd $hr; say '';

for my $k (sort keys %$hr) {
    next_level($hr, $k, $hr->{$k}, 'key');
}

# Takes: data structure (reference), key/index at which it is found, 
# its value for it, and description string of which it is, 'key|idx'
sub next_level {
    my ($ds, $index, $val, $kind) = @_;
    my $type = ref $val;

    if ($type eq 'ARRAY') {
        for my $i (0..$#$val) {
            next_level(
                ( $kind eq 'key' ? $ds->{$index} : $ds->[$index] ),
                $i, $val->[$i], 'idx' 
            );
        }

        # Collect indices for and delete elements that are empty
        my @to_delete;
        for my $i (0..$#$val) {
            if ( (ref $val->[$i] eq 'HASH'  and not keys %{$val->[$i]}) or
                 (ref $val->[$i] eq 'ARRAY' and not      @{$val->[$i]})  )
            {
                say "No value/empty for index $i, record for deletion";
                push @to_delete, $i;
            }
        }
        if (@to_delete) { 
            my %ref_idx = map { $_ => 1 } @to_delete;
            @$val = @$val[ grep { not exists $ref_idx{$_} } 0..$#$val ];
        }
    }
    elsif ($type eq 'HASH') {
        for my $k (sort keys %{$val}) {
            my $ds_next_level = 
                ($kind eq 'key') ? $ds->{$index} : $ds->[$index];

            next_level( $ds_next_level, $k, $val->{$k}, 'key' );

            # Delete if empty 
            if ( (ref $val->{$k} eq 'HASH'  and not keys %{$val->{$k}}) or
                 (ref $val->{$k} eq 'ARRAY' and not      @{$val->{$k}})  )
            {
                say "No value/empty for key $k, delete";
                delete $ds_next_level->{$k}
            }
        }
    }
    #elsif (not $type) { say "A value: ", $val }
}
say ''; dd $hr;

这是复杂数据结构的正常递归遍历,但有一个转折点:为了能够删除组件,递归子还需要数据结构本身,其中键(在 hashref 中)或索引 (在 arrayref 中)找到它,它是两者中的哪一个,一个键或一个索引。

递归后,如果目标为空且在 hashref 中,则目标将被删除。首先扫描 arrayref 以查找所有空元素,然后通过覆盖 arrayref 来删除它们,数组切片排除仅包含空数据结构的元素的索引。

为了排除“坏”索引,使用参考散列来提高效率。使用 map 覆盖数组可能更快(参见 this post),或者如果切片允许特定(解释器)优化则可能不会更快。

输出

{
  cat => "meow",
  dog => [
           { a => { chow => { name => "barney" } } },
           { b => { lab => "betty" }, c => "pebbles" },
           { d => { shihtzu => "bambam" } },
           { e => {}, f => { noval => {} }, g => [{}, { nada => {} }, []] },
         ],
}

No value/empty for key e, delete
No value/empty for key noval, delete
No value/empty for key f, delete
No value/empty for key nada, delete
No value/empty for index 0, record for deletion
No value/empty for index 1, record for deletion
No value/empty for index 2, record for deletion
No value/empty for key g, delete
No value/empty for index 3, record for deletion

{
  cat => "meow",
  dog => [
           { a => { chow => { name => "barney" } } },
           { b => { lab => "betty" }, c => "pebbles" },
           { d => { shihtzu => "bambam" } },
         ],
}