在数组中查找重复项——如何更快?
Finding duplicate in an array -- How to make faster?
我编写了一个函数,returns 如果数组包含重复项则为真,否则为假。我的运行时间仅在 Leet Code 上提交的第 50 个百分位。为什么?这不是 O(n) 吗?如何才能让它更快?
def contains_duplicate(nums)
hsh = Hash.new(0)
nums.each do |num|
hsh[num] +=1
if hsh[num] > 1
return true
end
end
return false
end
Runtime hash submission only in the 50th percentile
*编辑
出于好奇,这里是 Leet Code 上编码问题的 link:https://leetcode.com/problems/contains-duplicate/
哟偷看,我 运行 建议的设置代码并得到更糟糕的运行时间:
def contains_duplicate(nums)
s = Set.new
nums.each { |num| return true unless s.add?(num) }
false
end
Runtime set submission in 20th percentile
**最快运行时间
def contains_duplicate(nums)
hsh = Hash.new(0)
count=0
nums.each do |num|
count+=1
hsh[num]=1
if hsh.size < count
return true
end
end
return false
end
http://i.stack.imgur.com/Xx21p.png
你可以用一套。
require 'set'
s = Set.new
nums.each { |num| return true unless s.add?(num) }
false
参见 Set#add?。
不过,我不希望与 OP 的方法有任何显着差异,因为集合是用散列实现的。
...但让我们看看。
require 'fruity'
比较的方法如下:
def hash_way(nums)
hsh = Hash.new(0)
nums.each do |num|
return true if hsh.key?(num)
hsh[num] = 1
end
false
end
以上是OPs代码稍作改动
def set_way(nums)
s = Set.new
nums.each { |num| return true unless s.add?(num) }
false
end
@gonzolo2000 的方法(已删除)和@Jack 的方法,稍作修改:
def uniq_way(nums)
nums.uniq.size < nums.size
end
def hash2_way(nums)
hsh = Hash.new(0)
count=0
nums.each do |num|
count+=1
hsh[num]=1
if hsh.size < count
return true
end
end
return false
end
def bench(nums, n)
nums = arr(n)
compare do
_hash { hash_way(nums) }
_set { set_way(nums) }
_uniq { uniq_way(nums) }
_hash2 { hash2_way(nums) }
end
end
首先考虑一个有一个重复元素的数组:
def arr(n)
((1..n).to_a << 1).shuffle
end
例如,
arr(20)
#=> [17, 12, 1, 20, 3, 10, 15, 9, 5, 2, 14, 1, 18, 16, 7, 13, 19, 4, 8, 11, 6]
bench(nums, 100)
Running each test 128 times. Test will take about 1 second.
_hash2 is similar to _hash
_hash is similar to _uniq
_uniq is similar to _set
bench(nums, 1_000)
Running each test 32 times. Test will take about 1 second.
_hash2 is similar to _hash
_hash is similar to _set
_set is faster than _uniq by 2x ± 1.0
bench(nums, 10_000)
Running each test 2 times. Test will take about 1 second.
_hash2 is similar to _hash
_hash is similar to _set
_set is similar to _uniq
bench(nums, 100_000)
Running each test once. Test will take about 2 seconds.
_hash2 is similar to _hash
_hash is similar to _set
_set is faster than _uniq by 7x ± 1.0
bench(nums, 1_000_000)
Running each test once. Test will take about 51 seconds.
_hash2 is similar to _hash
_hash is faster than _uniq by 10.000000000000009% ± 10.0%
_uniq is similar to _set
现在我将更改测试数据,使数组中 10% 的唯一元素有一个重复:
def arr(n)
(1..n).to_a.concat((1..n/10).to_a).shuffle
end
例如,
arr(30)
#=> [14, 3, 1, 5, 20, 11, 4, 2, 25, 15, 23, 18, 30, 2, 19, 10, 13,
# 26, 24, 8, 6, 21, 16, 27, 7, 17, 12, 1, 29, 3, 28, 9, 22]
bench(nums, 100)
Running each test 512 times. Test will take about 1 second.
_hash2 is similar to _hash
_hash is similar to _set
_set is faster than _uniq by 3x ± 1.0
bench(nums, 1_000)
Running each test 128 times. Test will take about 2 seconds.
_hash2 is similar to _hash
_hash is similar to _set
_set is faster than _uniq by 9x ± 1.0
bench(nums, 10_000)
Running each test 128 times. Test will take about 8 seconds.
_hash2 is similar to _hash
_hash is similar to _set
_set is faster than _uniq by 79x ± 10.0
bench(nums, 100_000)
Running each test 16 times. Test will take about 17 seconds.
_hash2 is similar to _hash
_hash is similar to _set
_set is faster than _uniq by 180x ± 10.0
bench(nums, 1_000_000)
Running each test 4 times. Test will take about 56 seconds.
_hash2 is similar to _hash
_hash is similar to _set
_set is faster than _uniq by 810x ± 100.0
我总是喜欢编写显而易见的代码的想法。因此,我个人最喜欢的是这样的:
array.uniq == array
以及您的原始方法的一些基准测试:
a = (1..20).to_a
Benchmark.realtime { 100000.times{ contains_duplicate(a) }}
=> 0.937844
Benchmark.realtime { 100000.times{ a.uniq == a }}
=> 0.804872
还有一个包含重复项的数组:
a = a * 3
Benchmark.realtime { 100000.times{ contains_duplicate(a) }}
=> 1.068844
Benchmark.realtime { 100000.times{ a.uniq == a }}
=> 0.919273
我不熟悉 ruby,但我可以看到您的循环需要对每个项目执行 3 次哈希查找,而哈希查找是分配新项目后涉及的最昂贵的操作。
尝试这样的事情,每个项目只需要查找一次:
def contains_duplicate(nums)
hsh = Hash.new(0)
count=0
nums.each do |num|
count+=1
hsh[num]=1
if hsh.size < count
return true
end
end
return false
end
我编写了一个函数,returns 如果数组包含重复项则为真,否则为假。我的运行时间仅在 Leet Code 上提交的第 50 个百分位。为什么?这不是 O(n) 吗?如何才能让它更快?
def contains_duplicate(nums)
hsh = Hash.new(0)
nums.each do |num|
hsh[num] +=1
if hsh[num] > 1
return true
end
end
return false
end
Runtime hash submission only in the 50th percentile
*编辑 出于好奇,这里是 Leet Code 上编码问题的 link:https://leetcode.com/problems/contains-duplicate/
哟偷看,我 运行 建议的设置代码并得到更糟糕的运行时间:
def contains_duplicate(nums)
s = Set.new
nums.each { |num| return true unless s.add?(num) }
false
end
Runtime set submission in 20th percentile
**最快运行时间
def contains_duplicate(nums)
hsh = Hash.new(0)
count=0
nums.each do |num|
count+=1
hsh[num]=1
if hsh.size < count
return true
end
end
return false
end
http://i.stack.imgur.com/Xx21p.png
你可以用一套。
require 'set'
s = Set.new
nums.each { |num| return true unless s.add?(num) }
false
参见 Set#add?。
不过,我不希望与 OP 的方法有任何显着差异,因为集合是用散列实现的。
...但让我们看看。
require 'fruity'
比较的方法如下:
def hash_way(nums)
hsh = Hash.new(0)
nums.each do |num|
return true if hsh.key?(num)
hsh[num] = 1
end
false
end
以上是OPs代码稍作改动
def set_way(nums)
s = Set.new
nums.each { |num| return true unless s.add?(num) }
false
end
@gonzolo2000 的方法(已删除)和@Jack 的方法,稍作修改:
def uniq_way(nums)
nums.uniq.size < nums.size
end
def hash2_way(nums)
hsh = Hash.new(0)
count=0
nums.each do |num|
count+=1
hsh[num]=1
if hsh.size < count
return true
end
end
return false
end
def bench(nums, n)
nums = arr(n)
compare do
_hash { hash_way(nums) }
_set { set_way(nums) }
_uniq { uniq_way(nums) }
_hash2 { hash2_way(nums) }
end
end
首先考虑一个有一个重复元素的数组:
def arr(n)
((1..n).to_a << 1).shuffle
end
例如,
arr(20)
#=> [17, 12, 1, 20, 3, 10, 15, 9, 5, 2, 14, 1, 18, 16, 7, 13, 19, 4, 8, 11, 6]
bench(nums, 100)
Running each test 128 times. Test will take about 1 second.
_hash2 is similar to _hash
_hash is similar to _uniq
_uniq is similar to _set
bench(nums, 1_000)
Running each test 32 times. Test will take about 1 second.
_hash2 is similar to _hash
_hash is similar to _set
_set is faster than _uniq by 2x ± 1.0
bench(nums, 10_000)
Running each test 2 times. Test will take about 1 second.
_hash2 is similar to _hash
_hash is similar to _set
_set is similar to _uniq
bench(nums, 100_000)
Running each test once. Test will take about 2 seconds.
_hash2 is similar to _hash
_hash is similar to _set
_set is faster than _uniq by 7x ± 1.0
bench(nums, 1_000_000)
Running each test once. Test will take about 51 seconds.
_hash2 is similar to _hash
_hash is faster than _uniq by 10.000000000000009% ± 10.0%
_uniq is similar to _set
现在我将更改测试数据,使数组中 10% 的唯一元素有一个重复:
def arr(n)
(1..n).to_a.concat((1..n/10).to_a).shuffle
end
例如,
arr(30)
#=> [14, 3, 1, 5, 20, 11, 4, 2, 25, 15, 23, 18, 30, 2, 19, 10, 13,
# 26, 24, 8, 6, 21, 16, 27, 7, 17, 12, 1, 29, 3, 28, 9, 22]
bench(nums, 100)
Running each test 512 times. Test will take about 1 second.
_hash2 is similar to _hash
_hash is similar to _set
_set is faster than _uniq by 3x ± 1.0
bench(nums, 1_000)
Running each test 128 times. Test will take about 2 seconds.
_hash2 is similar to _hash
_hash is similar to _set
_set is faster than _uniq by 9x ± 1.0
bench(nums, 10_000)
Running each test 128 times. Test will take about 8 seconds.
_hash2 is similar to _hash
_hash is similar to _set
_set is faster than _uniq by 79x ± 10.0
bench(nums, 100_000)
Running each test 16 times. Test will take about 17 seconds.
_hash2 is similar to _hash
_hash is similar to _set
_set is faster than _uniq by 180x ± 10.0
bench(nums, 1_000_000)
Running each test 4 times. Test will take about 56 seconds.
_hash2 is similar to _hash
_hash is similar to _set
_set is faster than _uniq by 810x ± 100.0
我总是喜欢编写显而易见的代码的想法。因此,我个人最喜欢的是这样的:
array.uniq == array
以及您的原始方法的一些基准测试:
a = (1..20).to_a
Benchmark.realtime { 100000.times{ contains_duplicate(a) }}
=> 0.937844
Benchmark.realtime { 100000.times{ a.uniq == a }}
=> 0.804872
还有一个包含重复项的数组:
a = a * 3
Benchmark.realtime { 100000.times{ contains_duplicate(a) }}
=> 1.068844
Benchmark.realtime { 100000.times{ a.uniq == a }}
=> 0.919273
我不熟悉 ruby,但我可以看到您的循环需要对每个项目执行 3 次哈希查找,而哈希查找是分配新项目后涉及的最昂贵的操作。
尝试这样的事情,每个项目只需要查找一次:
def contains_duplicate(nums)
hsh = Hash.new(0)
count=0
nums.each do |num|
count+=1
hsh[num]=1
if hsh.size < count
return true
end
end
return false
end