Web 抓取:Expand/contract 边界框取决于结果

Web scraping: Expand/contract bounding box depending on results

一个客户想知道他们竞争对手商店的位置,所以我有点邪恶,抓取了竞争对手的网站。

服务器接受边界框(即左下角和右上角坐标)作为参数,return接受在边界框内找到的位置。这部分工作正常,我可以在给定边界框的情况下成功检索商店位置。

问题是 只有边界框内的前 10 个位置 被 returned - 所以在人口稠密的地区,10 度的边界框将 return 位置太多:

我总是可以使用较小的边界框,但我试图避免对服务器造成不必要的影响,同时确保所有商店都被 returned。

所以我需要一种方法在找到 10 家商店时减小搜索矩形的大小(因为可能存在超过 10 家商店),并以较小的搜索矩形大小递归搜索,然后恢复为较大的矩形对于下一个网格单元格。

我已经编写了在给定边界框的情况下从服务器检索商店的函数:

stores = checkForStores(<bounding box>)
if len(stores) >= 10:
  # There are too many stores. Search again with a smaller bounding box
else:
  # Everything is good - process these stores

但我正在为如何为 checkForStores 函数设置合适的边界框而苦恼。

我已经尝试在纬度和经度上使用 for 循环设置主网格单元:

cellsize = 10
for minLat in range(-40, -10, cellsize):
    for minLng in range(110, 150, cellsize):
        maxLat = minLat + cellsize
        maxLng = minLng + cellsize

...但我不知道如果找到 10 家商店,如何继续使用较小的边界框进行搜索。我也尝试使用 while 循环,但我无法使它们中的任何一个工作。

感谢您提供有关从哪里开始的任何建议或指示。

以下是使用递归的方法。代码应该是不言自明的,但它是这样工作的: 给定一些边界框,它检查其中的商店数量,如果大于或等于 10,则它将这个框分成更小的,并用每个新的边界框调用自己。它会这样做,直到找到少于 10 家商店。在那种情况下,找到的商店只是保存在列表中。

注意:由于使用递归,可能会出现超过最大递归深度的情况。这是理论上的。在你的情况下,即使你会通过 40 000 x 40 000 公里的边界框,也只需要 15 步就可以达到大约 1 x 1 公里的边界框 cell_axis_reduction_factor=2:

In [1]: import math

In [2]: math.log(40000, 2)
Out[2]: 15.287712379549449

无论如何,在这种情况下,您可以尝试增加 cell_axis_reduction_factor 个数字。

另请注意:在Python中,根据PEP 8,函数应该是小写字母,带下划线,所以我将checkForStores函数重命名为check_for_stores

# Save visited boxes. Only for debugging purpose.
visited_boxes = []


def check_for_stores(bounding_box):
    """Function mocking real `ckeck_fo_stores` function by returning
    random list of "stores"
    """
    import random
    randint = random.randint(1, 12)
    print 'Found {} stores for bounding box {}.'.format(randint, bounding_box)
    visited_boxes.append(bounding_box)
    return ['store'] * randint


def split_bounding_box(bounding_box, cell_axis_reduction_factor=2):
    """Returns generator of bounding box coordinates splitted
    from parent `bounding_box`

    :param bounding_box: tuple containing coordinates containing tuples of
          lower-left and upper-right corner coordinates,
          e.g. ((0, 5.2), (20.5, 14.0))
    :param cell_axis_reduction_factor: divide each axis in this param,
                                       in order to produce new box,
                                       meaning that in the end it will
                                       return `cell_axis_reduction_factor`**2 boxes
    :return: generator of bounding box coordinates

    """
    box_lc, box_rc = bounding_box
    box_lc_x, box_lc_y = box_lc
    box_rc_x, box_rc_y = box_rc

    cell_width = (box_rc_x - box_lc_x) / float(cell_axis_reduction_factor)
    cell_height = (box_rc_y - box_lc_y) / float(cell_axis_reduction_factor)

    for x_factor in xrange(cell_axis_reduction_factor):
        lc_x = box_lc_x + cell_width * x_factor
        rc_x = lc_x + cell_width

        for y_factor in xrange(cell_axis_reduction_factor):
            lc_y = box_lc_y + cell_height * y_factor
            rc_y = lc_y + cell_height

            yield ((lc_x, lc_y), (rc_x, rc_y))


def get_stores_in_box(bounding_box, result=None):
    """Returns list of stores found provided `bounding_box`.

    If there are more than or equal to 10 stores found in `bounding_box`,
    recursively splits current `bounding_box` into smaller one and checks
    stores in them.

    :param bounding_box: tuple containing coordinates containing tuples of
          lower-left and upper-right corner coordinates,
          e.g. ((0, 5.2), (20.5, 14.0))
    :param result: list containing found stores, found stores appended here;
                   used for recursive calls
    :return: list with found stores

    """
    if result is None:
        result = []

    print 'Checking for stores...'
    stores = check_for_stores(bounding_box)
    if len(stores) >= 10:
        print 'Stores number is more than or equal 10. Splitting bounding box...'
        for splitted_box_coords in split_bounding_box(bounding_box):
            get_stores_in_box(splitted_box_coords, result)
    else:
        print 'Stores number is less than 10. Saving results.'
        result += stores

    return result


stores = get_stores_in_box(((0, 1), (30, 20)))
print 'Found {} stores in total'.format(len(stores))
print 'Visited boxes: '
print visited_boxes

这是一个输出示例:

Checking for stores...
Found 10 stores for bounding box ((0, 1), (30, 20)).
Stores number is more than or equal 10. Splitting bounding box...
Checking for stores...
Found 4 stores for bounding box ((0.0, 1.0), (15.0, 10.5)).
Stores number is less than 10. Saving results.
Checking for stores...
Found 4 stores for bounding box ((0.0, 10.5), (15.0, 20.0)).
Stores number is less than 10. Saving results.
Checking for stores...
Found 10 stores for bounding box ((15.0, 1.0), (30.0, 10.5)).
Stores number is more than or equal 10. Splitting bounding box...
Checking for stores...
Found 1 stores for bounding box ((15.0, 1.0), (22.5, 5.75)).
Stores number is less than 10. Saving results.
Checking for stores...
Found 9 stores for bounding box ((15.0, 5.75), (22.5, 10.5)).
Stores number is less than 10. Saving results.
Checking for stores...
Found 4 stores for bounding box ((22.5, 1.0), (30.0, 5.75)).
Stores number is less than 10. Saving results.
Checking for stores...
Found 1 stores for bounding box ((22.5, 5.75), (30.0, 10.5)).
Stores number is less than 10. Saving results.
Checking for stores...
Found 6 stores for bounding box ((15.0, 10.5), (30.0, 20.0)).
Stores number is less than 10. Saving results.
Found 29 stores in total
Visited boxes: 
[
((0, 1), (30, 20)), 
((0.0, 1.0), (15.0, 10.5)), 
((0.0, 10.5), (15.0, 20.0)), 
((15.0, 1.0), (30.0, 10.5)), 
((15.0, 1.0), (22.5, 5.75)), 
((15.0, 5.75), (22.5, 10.5)), 
((22.5, 1.0), (30.0, 5.75)), 
((22.5, 5.75), (30.0, 10.5)), 
((15.0, 10.5), (30.0, 20.0))
]