Ruby Nokogiri 解析省略重复项
Ruby Nokogiri parsing omit duplicates
我正在解析 XML 文件并希望忽略添加到我的数组中的重复值。就目前而言,XML 将如下所示:
<vulnerable-software-list>
<product>cpe:/a:octopus:octopus_deploy:3.0.0</product>
<product>cpe:/a:octopus:octopus_deploy:3.0.1</product>
<product>cpe:/a:octopus:octopus_deploy:3.0.2</product>
<product>cpe:/a:octopus:octopus_deploy:3.0.3</product>
<product>cpe:/a:octopus:octopus_deploy:3.0.4</product>
<product>cpe:/a:octopus:octopus_deploy:3.0.5</product>
<product>cpe:/a:octopus:octopus_deploy:3.0.6</product>
</vulnerable-software-list>
document.xpath("//entry[
number(substring(translate(last-modified-datetime,'-.T:',''), 1, 12)) > #{last_imported_at} and
cvss/base_metrics/access-vector = 'NETWORK'
]").each do |entry|
product = entry.xpath('vulnerable-software-list/product').map { |product| product.content.split(':')[-2] }
effected_versions = entry.xpath('vulnerable-software-list/product').map { |product| product.content.split(':').last }
puts product
end
但是,由于 XML 输入,它解析了相当多的重复项,所以我最终得到了一个像 ['Redhat','Redhat','Redhat','Fedora']
这样的数组
我已经处理好 effected_versions
,因为这些值不会重复。
有没有.map
只添加唯一值的方法?
如果需要获取唯一值的数组,那么只需调用uniq
方法获取唯一值:
product =
entry.xpath('vulnerable-software-list/product').map do |product|
product.content.split(':')[-2]
end.uniq
有很多方法可以做到这一点:
input = ['Redhat','Redhat','Redhat','Fedora']
# approach 1
# self explanatory
result = input.uniq
# approach 2
# iterate through vals, and build a hash with the vals as keys
# since hashes cannot have duplicate keys, it provides a 'unique' check
result = input.each_with_object({}) { |val, memo| memo[val] = true }.keys
# approach 3
# Similar to the previous, we iterate through vals and add them to a Set.
# Adding a duplicate value to a set has no effect, and we can convert it to array
result = input.each_with_object.(Set.new) { |val, memo| memo.add(val) }.to_a
如果您不熟悉 each_with_object, it's very similar to reduce
关于性能,你可以搜索一些信息,例如What is the fastest way to make a uniq array?
通过快速测试,我发现它们的性能会随着时间的增加而增加。 uniq
比 each_with_object
快 5 倍,比 Set.new
方法慢 25%。可能是因为排序是使用 C 实现的。不过我只测试了任意输入,所以它可能不适用于所有情况。
我正在解析 XML 文件并希望忽略添加到我的数组中的重复值。就目前而言,XML 将如下所示:
<vulnerable-software-list>
<product>cpe:/a:octopus:octopus_deploy:3.0.0</product>
<product>cpe:/a:octopus:octopus_deploy:3.0.1</product>
<product>cpe:/a:octopus:octopus_deploy:3.0.2</product>
<product>cpe:/a:octopus:octopus_deploy:3.0.3</product>
<product>cpe:/a:octopus:octopus_deploy:3.0.4</product>
<product>cpe:/a:octopus:octopus_deploy:3.0.5</product>
<product>cpe:/a:octopus:octopus_deploy:3.0.6</product>
</vulnerable-software-list>
document.xpath("//entry[
number(substring(translate(last-modified-datetime,'-.T:',''), 1, 12)) > #{last_imported_at} and
cvss/base_metrics/access-vector = 'NETWORK'
]").each do |entry|
product = entry.xpath('vulnerable-software-list/product').map { |product| product.content.split(':')[-2] }
effected_versions = entry.xpath('vulnerable-software-list/product').map { |product| product.content.split(':').last }
puts product
end
但是,由于 XML 输入,它解析了相当多的重复项,所以我最终得到了一个像 ['Redhat','Redhat','Redhat','Fedora']
我已经处理好 effected_versions
,因为这些值不会重复。
有没有.map
只添加唯一值的方法?
如果需要获取唯一值的数组,那么只需调用uniq
方法获取唯一值:
product =
entry.xpath('vulnerable-software-list/product').map do |product|
product.content.split(':')[-2]
end.uniq
有很多方法可以做到这一点:
input = ['Redhat','Redhat','Redhat','Fedora']
# approach 1
# self explanatory
result = input.uniq
# approach 2
# iterate through vals, and build a hash with the vals as keys
# since hashes cannot have duplicate keys, it provides a 'unique' check
result = input.each_with_object({}) { |val, memo| memo[val] = true }.keys
# approach 3
# Similar to the previous, we iterate through vals and add them to a Set.
# Adding a duplicate value to a set has no effect, and we can convert it to array
result = input.each_with_object.(Set.new) { |val, memo| memo.add(val) }.to_a
如果您不熟悉 each_with_object, it's very similar to reduce
关于性能,你可以搜索一些信息,例如What is the fastest way to make a uniq array?
通过快速测试,我发现它们的性能会随着时间的增加而增加。 uniq
比 each_with_object
快 5 倍,比 Set.new
方法慢 25%。可能是因为排序是使用 C 实现的。不过我只测试了任意输入,所以它可能不适用于所有情况。