从散列和数组的散列中删除重复的键(并确保任何由此产生的空散列也被删除)
Remove duplicate keys from a hash of hashes and arrays (and ensure any resulting empty hashs are also removed)
我有一个 JSON 格式的数据流,我的脚本从内部网站访问该数据流。我的脚本使用 JSON.pm 将 JSON 转换为 perl 散列(我在 RHEL 6.9 上使用 perl 5.10.1)
在这个散列中有多个嵌套散列和嵌套数组,其中一些嵌套在大散列内部的其他 hashes/arrays 中。
我需要遍历哈希的整个结构,包括所有数组和嵌套哈希,并删除整个结构中任何位置的任何键,这些键与任何其他键共享相同的名称(仅针对特定键虽然名字)。
此外,由于数据的结构,一些嵌套的散列只有现在被删除的键,将一些键的值保留为空散列。我还需要删除那些其值为空散列的键
这是我转换为 perl 后的数据:
$VAR1 = {
'cat' => 'meow',
'dog' => [
{
'a' => {
'husky' => {
'name' => 'fred'
},
'chow' => {
'name' => 'barney'
}
},
},
{
'b' => {
'husky' => 'wilma',
'lab' => 'betty'
},
'c' => 'pebbles' # yes this is intentionally a scalar in the example
},
{
'd' => {
'shihtzu' => 'bambam'
},
},
{
'e' => {
'husky' => 'dino'
},
},
],
}
我们要删除所有名为 'husky'
的键
它应该是这样的:
$VAR1 = {
'cat' => 'meow',
'dog' => [
{
'a' => {
'chow' => {
'name' => 'barney'
}
},
},
{
'b' => {
'labrador' => 'betty'
},
'c' => 'pebbles'
},
{
'd' => {
'shihtzu' => 'bambam'
},
},
],
}
这是我添加@Shawn 的代码并对其进行调整后得到的结果(这非常接近,但我们需要考虑空哈希值:
$VAR1 = {
'cat' => 'meow',
'dog' => [
{
'a' => {
'chow' => {
'name' => 'barney'
}
},
},
{
'b' => {
'lab' => 'betty'
},
'c' => 'pebbles' # yes this is intentionally a scalar in the example
},
{
'd' => {
'shihtzu' => 'bambam'
},
},
{
'e' => {},
},
]
}
我尝试了在 SO 和 perlmonks 上其他地方发现的一些变体。 keys %$_ == 0
、!%$_
等等。但是 none 似乎可以使用这个散列切片。
代码:
use 5.008008;
use strict;
use warnings;
use English; # I know I know, don't use English...
use JSON;
use YAML::Tiny qw(Dump);
# proprietary modules I wrote added here, which themselves load in LWP, HTTP::Cookie and others, and they do the bulk of building and sending the request. They are the back end to this script's front end.
[-snipped a ton of code-]
sub _count_keys
{
my ($j, $seen) = @ARG;
my $type = ref $j;
if ($type eq "ARRAY")
{
for (@{$j})
{
_count_keys($ARG, $seen);
}
}
elsif ($type eq "HASH")
{
while (my ($key, $val) = each %{$j})
{
$seen->{$key}++;
if (ref $val)
{
_count_keys($val, $seen);
}
}
}
return $seen;
}
sub _remove_duplicate_keys
{
my ($j, $seen) = @ARG;
$seen //= _count_keys($j, {});
my $type = ref $j;
if ($type eq "ARRAY")
{
return [ map { _remove_duplicate_keys($ARG, $seen) } @{$j} ];
}
elsif ($type eq "HASH")
{
my %obj = %{$j};
delete @obj{grep { $seen->{$ARG} > 1 and $ARG eq 'keyNameToBeExcluded'} keys %obj};
# Here is where I have been putting another delete line but I can't seem to find the right parameters for the grep to make it delete the empty anon hashes. Example of what I tried is the next comment below
# delete @obj{grep { $seen->{$ARG} > 1 and keys $ARG{assetDetails} == 0 } keys %obj};
while (my ($key, $val) = each %obj)
{
if (ref $val)
{
$obj{$key} = _remove_duplicate_keys($val, $seen);
}
}
return \%obj;
}
else
{
return $j;
}
}
sub _process_json
{
my $JSONOUTPUT = shift;
my $OPTIONS = shift;
# Change true to 1 and false to 0 to prevent blessed objects from appearing in the JSON, which prevents the YAML::Tiny module from barfing
foreach (@{$JSONOUTPUT})
{
s{true(,\n)}{1}gxms;
s{false(,\n)}{0}gxms;
}
my $JSONPERLOBJ = JSON->new->utf8->decode(@{$JSONOUTPUT});
# Test code below here; real code not in use while I test getting the output right.
use Data::Dumper;
my $BEFORE = $JSONPERLOBJ;
my $AFTER = _remove_duplicate_keys($JSONPERLOBJ);
# $JSONPERLOBJ = _remove_duplicate_keys($JSONPERLOBJ);
#print Dumper $BEFORE;
print Dumper $AFTER;
exit 1;
# End test code
}
sub _main
{
[-snip private code-]
my @JSONOUTPUT = $RESPONSE->decoded_content;
my $RC = _process_json(\@JSONOUTPUT, $OPTIONS);
exit ($RC == 1)?0:1;
}
我认为这符合您的要求:
#!/usr/bin/perl
use warnings;
use strict;
use feature qw/say/;
use JSON::XS; # Better than JSON; also see JSON::MaybeXS
my $j = <<EOJSON;
{
"foo": 1,
"bar": {
"foo": true,
"baz": false
},
"dog": "woof",
"cat": [ { "foo": 3 } ]
}
EOJSON
sub count_keys {
my ($j, $seen) = @_;
my $type = ref $j;
if ($type eq "ARRAY") {
count_keys($_, $seen) for @$j;
} elsif ($type eq "HASH") {
while (my ($key, $val) = each %$j) {
$seen->{$key}++;
count_keys($val, $seen) if ref $val;
}
}
return $seen;
}
sub remove_dups {
my ($j, $seen) = @_;
$seen //= count_keys($j, {});
my $type = ref $j;
if ($type eq "ARRAY") {
return [ map { remove_dups($_, $seen) } @$j ];
} elsif ($type eq "HASH") {
my %obj = %$j;
delete @obj{grep { $seen->{$_} > 1 } keys %obj};
while (my ($key, $val) = each %obj) {
$obj{$key} = remove_dups($val, $seen) if ref $val;
}
return \%obj;
} else {
return $j;
}
}
my $parsed = decode_json $j;
my $printer = JSON::XS->new->pretty->canonical;
say "Before:";
print $printer->encode($parsed);
say "After:";
my $dedup = remove_dups $parsed;
print $printer->encode($dedup);
产生
Before:
{
"bar" : {
"baz" : false,
"foo" : true
},
"cat" : [
{
"foo" : 3
}
],
"dog" : "woof",
"foo" : 1
}
After:
{
"bar" : {
"baz" : false
},
"cat" : [
{}
],
"dog" : "woof"
}
编辑解释:
第一次在表示 json 值(不必是 json 对象)的 perl 数据结构上调用 remove_dups
,它调用 count_keys
递归遍历结构并创建所有键的散列和每个键出现的次数。然后它再次递归地遍历结构,返回一个深副本,没有在原始文件中出现不止一次的键。
这条线是真正的魔法:
delete @obj{grep { $seen->{$_} > 1 } keys %obj};
它使用散列切片一次删除一堆键,grep位返回出现不止一次的键列表。 More information 在切片上。
我认为 可用于删除重复项,看起来不错。
后续问题是我们最终可能会得到空结构,这些结构也需要删除。但是也可能有只包含空结构等的结构,我认为所有这些都需要消失。
我使用了问题中的 desired-result-hashref(我从中删除了一个 name=>...
以便没有重复项)并添加了一些空洞的麻烦。
use warnings;
use strict;
use feature 'say';
use Data::Dump qw(dd pp);
my $hr = {
'cat' => 'meow',
'dog' => [
{ 'a' => { 'chow' => { 'name' => 'barney' } } },
{ 'b' => { 'lab' => 'betty' }, 'c' => 'pebbles' },
{ 'd' => { 'shihtzu' => 'bambam' } },
{ # all of the following need to go, and this hashref
'e' => { },
'f' => { noval => { } },
'g' => [ { }, { nada => { } }, [ ] ],
},
],
};
dd $hr; say '';
for my $k (sort keys %$hr) {
next_level($hr, $k, $hr->{$k}, 'key');
}
# Takes: data structure (reference), key/index at which it is found,
# its value for it, and description string of which it is, 'key|idx'
sub next_level {
my ($ds, $index, $val, $kind) = @_;
my $type = ref $val;
if ($type eq 'ARRAY') {
for my $i (0..$#$val) {
next_level(
( $kind eq 'key' ? $ds->{$index} : $ds->[$index] ),
$i, $val->[$i], 'idx'
);
}
# Collect indices for and delete elements that are empty
my @to_delete;
for my $i (0..$#$val) {
if ( (ref $val->[$i] eq 'HASH' and not keys %{$val->[$i]}) or
(ref $val->[$i] eq 'ARRAY' and not @{$val->[$i]}) )
{
say "No value/empty for index $i, record for deletion";
push @to_delete, $i;
}
}
if (@to_delete) {
my %ref_idx = map { $_ => 1 } @to_delete;
@$val = @$val[ grep { not exists $ref_idx{$_} } 0..$#$val ];
}
}
elsif ($type eq 'HASH') {
for my $k (sort keys %{$val}) {
my $ds_next_level =
($kind eq 'key') ? $ds->{$index} : $ds->[$index];
next_level( $ds_next_level, $k, $val->{$k}, 'key' );
# Delete if empty
if ( (ref $val->{$k} eq 'HASH' and not keys %{$val->{$k}}) or
(ref $val->{$k} eq 'ARRAY' and not @{$val->{$k}}) )
{
say "No value/empty for key $k, delete";
delete $ds_next_level->{$k}
}
}
}
#elsif (not $type) { say "A value: ", $val }
}
say ''; dd $hr;
这是复杂数据结构的正常递归遍历,但有一个转折点:为了能够删除组件,递归子还需要数据结构本身,其中键(在 hashref 中)或索引 (在 arrayref 中)找到它,它是两者中的哪一个,一个键或一个索引。
递归后,如果目标为空且在 hashref 中,则目标将被删除。首先扫描 arrayref 以查找所有空元素,然后通过覆盖 arrayref 来删除它们,数组切片排除仅包含空数据结构的元素的索引。
为了排除“坏”索引,使用参考散列来提高效率。使用 map
覆盖数组可能更快(参见 this post),或者如果切片允许特定(解释器)优化则可能不会更快。
输出
{
cat => "meow",
dog => [
{ a => { chow => { name => "barney" } } },
{ b => { lab => "betty" }, c => "pebbles" },
{ d => { shihtzu => "bambam" } },
{ e => {}, f => { noval => {} }, g => [{}, { nada => {} }, []] },
],
}
No value/empty for key e, delete
No value/empty for key noval, delete
No value/empty for key f, delete
No value/empty for key nada, delete
No value/empty for index 0, record for deletion
No value/empty for index 1, record for deletion
No value/empty for index 2, record for deletion
No value/empty for key g, delete
No value/empty for index 3, record for deletion
{
cat => "meow",
dog => [
{ a => { chow => { name => "barney" } } },
{ b => { lab => "betty" }, c => "pebbles" },
{ d => { shihtzu => "bambam" } },
],
}
我有一个 JSON 格式的数据流,我的脚本从内部网站访问该数据流。我的脚本使用 JSON.pm 将 JSON 转换为 perl 散列(我在 RHEL 6.9 上使用 perl 5.10.1)
在这个散列中有多个嵌套散列和嵌套数组,其中一些嵌套在大散列内部的其他 hashes/arrays 中。
我需要遍历哈希的整个结构,包括所有数组和嵌套哈希,并删除整个结构中任何位置的任何键,这些键与任何其他键共享相同的名称(仅针对特定键虽然名字)。
此外,由于数据的结构,一些嵌套的散列只有现在被删除的键,将一些键的值保留为空散列。我还需要删除那些其值为空散列的键
这是我转换为 perl 后的数据:
$VAR1 = {
'cat' => 'meow',
'dog' => [
{
'a' => {
'husky' => {
'name' => 'fred'
},
'chow' => {
'name' => 'barney'
}
},
},
{
'b' => {
'husky' => 'wilma',
'lab' => 'betty'
},
'c' => 'pebbles' # yes this is intentionally a scalar in the example
},
{
'd' => {
'shihtzu' => 'bambam'
},
},
{
'e' => {
'husky' => 'dino'
},
},
],
}
我们要删除所有名为 'husky'
的键它应该是这样的:
$VAR1 = {
'cat' => 'meow',
'dog' => [
{
'a' => {
'chow' => {
'name' => 'barney'
}
},
},
{
'b' => {
'labrador' => 'betty'
},
'c' => 'pebbles'
},
{
'd' => {
'shihtzu' => 'bambam'
},
},
],
}
这是我添加@Shawn 的代码并对其进行调整后得到的结果(这非常接近,但我们需要考虑空哈希值:
$VAR1 = {
'cat' => 'meow',
'dog' => [
{
'a' => {
'chow' => {
'name' => 'barney'
}
},
},
{
'b' => {
'lab' => 'betty'
},
'c' => 'pebbles' # yes this is intentionally a scalar in the example
},
{
'd' => {
'shihtzu' => 'bambam'
},
},
{
'e' => {},
},
]
}
我尝试了在 SO 和 perlmonks 上其他地方发现的一些变体。 keys %$_ == 0
、!%$_
等等。但是 none 似乎可以使用这个散列切片。
代码:
use 5.008008;
use strict;
use warnings;
use English; # I know I know, don't use English...
use JSON;
use YAML::Tiny qw(Dump);
# proprietary modules I wrote added here, which themselves load in LWP, HTTP::Cookie and others, and they do the bulk of building and sending the request. They are the back end to this script's front end.
[-snipped a ton of code-]
sub _count_keys
{
my ($j, $seen) = @ARG;
my $type = ref $j;
if ($type eq "ARRAY")
{
for (@{$j})
{
_count_keys($ARG, $seen);
}
}
elsif ($type eq "HASH")
{
while (my ($key, $val) = each %{$j})
{
$seen->{$key}++;
if (ref $val)
{
_count_keys($val, $seen);
}
}
}
return $seen;
}
sub _remove_duplicate_keys
{
my ($j, $seen) = @ARG;
$seen //= _count_keys($j, {});
my $type = ref $j;
if ($type eq "ARRAY")
{
return [ map { _remove_duplicate_keys($ARG, $seen) } @{$j} ];
}
elsif ($type eq "HASH")
{
my %obj = %{$j};
delete @obj{grep { $seen->{$ARG} > 1 and $ARG eq 'keyNameToBeExcluded'} keys %obj};
# Here is where I have been putting another delete line but I can't seem to find the right parameters for the grep to make it delete the empty anon hashes. Example of what I tried is the next comment below
# delete @obj{grep { $seen->{$ARG} > 1 and keys $ARG{assetDetails} == 0 } keys %obj};
while (my ($key, $val) = each %obj)
{
if (ref $val)
{
$obj{$key} = _remove_duplicate_keys($val, $seen);
}
}
return \%obj;
}
else
{
return $j;
}
}
sub _process_json
{
my $JSONOUTPUT = shift;
my $OPTIONS = shift;
# Change true to 1 and false to 0 to prevent blessed objects from appearing in the JSON, which prevents the YAML::Tiny module from barfing
foreach (@{$JSONOUTPUT})
{
s{true(,\n)}{1}gxms;
s{false(,\n)}{0}gxms;
}
my $JSONPERLOBJ = JSON->new->utf8->decode(@{$JSONOUTPUT});
# Test code below here; real code not in use while I test getting the output right.
use Data::Dumper;
my $BEFORE = $JSONPERLOBJ;
my $AFTER = _remove_duplicate_keys($JSONPERLOBJ);
# $JSONPERLOBJ = _remove_duplicate_keys($JSONPERLOBJ);
#print Dumper $BEFORE;
print Dumper $AFTER;
exit 1;
# End test code
}
sub _main
{
[-snip private code-]
my @JSONOUTPUT = $RESPONSE->decoded_content;
my $RC = _process_json(\@JSONOUTPUT, $OPTIONS);
exit ($RC == 1)?0:1;
}
我认为这符合您的要求:
#!/usr/bin/perl
use warnings;
use strict;
use feature qw/say/;
use JSON::XS; # Better than JSON; also see JSON::MaybeXS
my $j = <<EOJSON;
{
"foo": 1,
"bar": {
"foo": true,
"baz": false
},
"dog": "woof",
"cat": [ { "foo": 3 } ]
}
EOJSON
sub count_keys {
my ($j, $seen) = @_;
my $type = ref $j;
if ($type eq "ARRAY") {
count_keys($_, $seen) for @$j;
} elsif ($type eq "HASH") {
while (my ($key, $val) = each %$j) {
$seen->{$key}++;
count_keys($val, $seen) if ref $val;
}
}
return $seen;
}
sub remove_dups {
my ($j, $seen) = @_;
$seen //= count_keys($j, {});
my $type = ref $j;
if ($type eq "ARRAY") {
return [ map { remove_dups($_, $seen) } @$j ];
} elsif ($type eq "HASH") {
my %obj = %$j;
delete @obj{grep { $seen->{$_} > 1 } keys %obj};
while (my ($key, $val) = each %obj) {
$obj{$key} = remove_dups($val, $seen) if ref $val;
}
return \%obj;
} else {
return $j;
}
}
my $parsed = decode_json $j;
my $printer = JSON::XS->new->pretty->canonical;
say "Before:";
print $printer->encode($parsed);
say "After:";
my $dedup = remove_dups $parsed;
print $printer->encode($dedup);
产生
Before:
{
"bar" : {
"baz" : false,
"foo" : true
},
"cat" : [
{
"foo" : 3
}
],
"dog" : "woof",
"foo" : 1
}
After:
{
"bar" : {
"baz" : false
},
"cat" : [
{}
],
"dog" : "woof"
}
编辑解释:
第一次在表示 json 值(不必是 json 对象)的 perl 数据结构上调用 remove_dups
,它调用 count_keys
递归遍历结构并创建所有键的散列和每个键出现的次数。然后它再次递归地遍历结构,返回一个深副本,没有在原始文件中出现不止一次的键。
这条线是真正的魔法:
delete @obj{grep { $seen->{$_} > 1 } keys %obj};
它使用散列切片一次删除一堆键,grep位返回出现不止一次的键列表。 More information 在切片上。
我认为
后续问题是我们最终可能会得到空结构,这些结构也需要删除。但是也可能有只包含空结构等的结构,我认为所有这些都需要消失。
我使用了问题中的 desired-result-hashref(我从中删除了一个 name=>...
以便没有重复项)并添加了一些空洞的麻烦。
use warnings;
use strict;
use feature 'say';
use Data::Dump qw(dd pp);
my $hr = {
'cat' => 'meow',
'dog' => [
{ 'a' => { 'chow' => { 'name' => 'barney' } } },
{ 'b' => { 'lab' => 'betty' }, 'c' => 'pebbles' },
{ 'd' => { 'shihtzu' => 'bambam' } },
{ # all of the following need to go, and this hashref
'e' => { },
'f' => { noval => { } },
'g' => [ { }, { nada => { } }, [ ] ],
},
],
};
dd $hr; say '';
for my $k (sort keys %$hr) {
next_level($hr, $k, $hr->{$k}, 'key');
}
# Takes: data structure (reference), key/index at which it is found,
# its value for it, and description string of which it is, 'key|idx'
sub next_level {
my ($ds, $index, $val, $kind) = @_;
my $type = ref $val;
if ($type eq 'ARRAY') {
for my $i (0..$#$val) {
next_level(
( $kind eq 'key' ? $ds->{$index} : $ds->[$index] ),
$i, $val->[$i], 'idx'
);
}
# Collect indices for and delete elements that are empty
my @to_delete;
for my $i (0..$#$val) {
if ( (ref $val->[$i] eq 'HASH' and not keys %{$val->[$i]}) or
(ref $val->[$i] eq 'ARRAY' and not @{$val->[$i]}) )
{
say "No value/empty for index $i, record for deletion";
push @to_delete, $i;
}
}
if (@to_delete) {
my %ref_idx = map { $_ => 1 } @to_delete;
@$val = @$val[ grep { not exists $ref_idx{$_} } 0..$#$val ];
}
}
elsif ($type eq 'HASH') {
for my $k (sort keys %{$val}) {
my $ds_next_level =
($kind eq 'key') ? $ds->{$index} : $ds->[$index];
next_level( $ds_next_level, $k, $val->{$k}, 'key' );
# Delete if empty
if ( (ref $val->{$k} eq 'HASH' and not keys %{$val->{$k}}) or
(ref $val->{$k} eq 'ARRAY' and not @{$val->{$k}}) )
{
say "No value/empty for key $k, delete";
delete $ds_next_level->{$k}
}
}
}
#elsif (not $type) { say "A value: ", $val }
}
say ''; dd $hr;
这是复杂数据结构的正常递归遍历,但有一个转折点:为了能够删除组件,递归子还需要数据结构本身,其中键(在 hashref 中)或索引 (在 arrayref 中)找到它,它是两者中的哪一个,一个键或一个索引。
递归后,如果目标为空且在 hashref 中,则目标将被删除。首先扫描 arrayref 以查找所有空元素,然后通过覆盖 arrayref 来删除它们,数组切片排除仅包含空数据结构的元素的索引。
为了排除“坏”索引,使用参考散列来提高效率。使用 map
覆盖数组可能更快(参见 this post),或者如果切片允许特定(解释器)优化则可能不会更快。
输出
{ cat => "meow", dog => [ { a => { chow => { name => "barney" } } }, { b => { lab => "betty" }, c => "pebbles" }, { d => { shihtzu => "bambam" } }, { e => {}, f => { noval => {} }, g => [{}, { nada => {} }, []] }, ], } No value/empty for key e, delete No value/empty for key noval, delete No value/empty for key f, delete No value/empty for key nada, delete No value/empty for index 0, record for deletion No value/empty for index 1, record for deletion No value/empty for index 2, record for deletion No value/empty for key g, delete No value/empty for index 3, record for deletion { cat => "meow", dog => [ { a => { chow => { name => "barney" } } }, { b => { lab => "betty" }, c => "pebbles" }, { d => { shihtzu => "bambam" } }, ], }