获取每个给定范围之间的整数个数

Get the number of integers between each given range

任务:

Write a function that receives 3 arrays and returns an array. The first array contains n integers, their values range between 0 and 10^9. "numbers". The second array is a low-range array, which contains the lower end of a range, it contains q integers. "low". The third array is a high-range array, which contains the higher end of a range, it contains q integers. "high".

The function should return an array that contains the number of integers in the first array, that fall in its range, given by the low-range and high-range arrays.

In the returned array, at index i, there should be the number of integers in "numbers" which are bigger or equal to low[i] and smaller or equal to high[i].

示例:
count_range([12,13,14,15,17],[14],[14]) 应该 return [1]
count_range([12,13,14,15,17],[14,15],[14,18]) 应该 return [1,2]
count_range([12,13,14,15,17],[12],[17]) 应该 return [5]

这就是我解决问题的方法,但我觉得可能有更有效的方法来解决这个问题,因为数组可能很长而且数字可能非常大。

我很高兴得到一些挑战这个的见解或测试可以帮助我朝着更好的方向思考。

def binarySearch(data, val):
    highIndex = len(data) - 1
    lowIndex = 0
    while highIndex > lowIndex:
        index = math.ceil((highIndex + lowIndex) / 2)
        sub = data[index]
        if sub > val:
            if highIndex == index:
                return sorted([highIndex, lowIndex])
            highIndex = index
        else:
            if lowIndex == index:
                return sorted([highIndex, lowIndex])
            lowIndex = index
    return sorted([highIndex, lowIndex])


def count_range(numbers, low, high):
    numbers.sort()
    result = []
    range_dict = {}
    for i in range(len(numbers)):
        if numbers[i] not in range_dict:
            range_dict[numbers[i]] = i
    for i in range(len(low)):
        low_r = low[i]
        high_r = high[i]
        if low_r not in range_dict:
            range_dict[low_r] = binarySearch(numbers, low_r)[0]
        low_index = range_dict.get(low_r)
        if high_r not in range_dict:
            range_dict[high_r] = binarySearch(numbers, high_r)[0]
        high_index = range_dict.get(high_r)
        while high_index+1 < len(numbers) and numbers[high_index + 1] == numbers[high_index]:
            high_index += 1
        if low_r in numbers or low_r < numbers[0]:
            low_index -= 1
        result.append(high_index - low_index)
    print(result)
    return result

使用 Python 的 bisect:

from bisect import bisect_left, bisect_right

def count_range(numbers, low, high):
    numbers.sort()
    return [bisect_right(numbers, hi) - bisect_left(numbers, lo)
            for hi, lo in zip(high, low)]

在我的解决方案中,排序实际上占用了我大约 90% 的时间,然后处理查询只占用了我大约 10% 的时间。

具有 100,000 个数字和 1000 个范围的基准 (Try it online!):

  71.6 ms  count_range_Sara
  24.1 ms  count_range_Kelly
6303.6 ms  count_range_Nin17

  70.5 ms  count_range_Sara
  23.1 ms  count_range_Kelly
6225.0 ms  count_range_Nin17

  64.5 ms  count_range_Sara
  23.1 ms  count_range_Kelly
6306.7 ms  count_range_Nin17

(注意:我测量的 Sara 的解决方案是原始的略有错误的解决方案,尽管存在错误但我还是将其包括在内,因为问题是关于效率的。我测量的 Nin17 的解决方案是他们现在已删除的答案中的原始解决方案,而不是NumPy 一个。)

使用 numpy:

import numpy as np
def count_range(numbers, low, high):
    numbers.sort()
    return np.searchsorted(numbers, high, 'right') - np.searchsorted(numbers, low, 'left')

基准:

from timeit import timeit
def test_data():
    RANGE = range(10**9)
    numbers = random.choices(RANGE, k=10**5)
    ranges = (sorted(random.sample(RANGE, 2))
              for _ in range(1000))
    [*low], [*high] = zip(*ranges)
    return [numbers, low, high]
test = test_data()
%timeit kelly(*[t.copy() for t in test])
test2 = [np.array(i) for i in test]
%timeit Nin17(*[t.copy() for t in test2])

输出:

13.4 ms ± 77.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
4.63 ms ± 14 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)