加速蛮力 'tally' 算法的任何替代方法?
Any alternatives to speeding up brute force 'tally' algorithm?
如果这是 post 这个问题的错误地方,请提前致歉。如果有更好的堆栈交换站点,请告诉我。
因此目前正在开发一种犯罪预测算法,该算法实质上是在城市上方放置一个网格,并预测每个网格条目在接下来的 30 天内是否会成为热点(至少发生一起袭击犯罪)。
我目前使用的是纳什维尔市,其网格覆盖有 3446 个网格。我有一个网格数据集,其中包含显示网格所需的所有数据、每个网格的地图坐标以及它周围的相邻网格(底部相邻、右侧相邻等)
以下是预测结果的示例:
在这种情况下,绿色表示正确预测。红色表示假阴性,紫色表示机器学习算法的假阳性。
为了训练我的神经网络,我使用了如下所示的功能集:
这里的 Hotspot 是目标值(1 和 0)。周、月、年是从去年(上周、上个月和去年发生的犯罪)中提取的犯罪事件的犯罪统计。我的问题是创建这些功能集需要花费大量时间(脚本需要 6 个多小时)
#Loop through each grid in the dataset
for grid_index, grid_row in grid.iterrows():
print("On grid number: ", grid_row['id'])
near=0
#Loop through all of the crimes
for crime_index, crime_row in crime.iterrows():
#Parse out the month, day, and year
date = crime_row['Incident Occurred']
date_pars = date.split('/')
month = int(date_pars[0])
day= int(date_pars[1])
year =int(date_pars[2].split(' ')[0])
if grid_row['top '] == crime_row['grid']:
near +=1
if grid_row['bottom '] == crime_row['grid']:
near +=1
if grid_row['left '] == crime_row['grid']:
near +=1
if grid_row['right '] == crime_row['grid']:
near +=1
if grid_row['topleft'] == crime_row['grid']:
near +=1
if grid_row['topright'] == crime_row['grid']:
near +=1
if grid_row['bottomright'] == crime_row['grid']:
near +=1
if grid_row['bottomleft'] == crime_row['grid']:
near +=1
if month == 12 and grid_row['id'] == crime_row['grid']:
countMonth = countMonth+1
if day >= 25 and month == 12 and grid_row['id'] == crime_row['grid']:
countWeek = countWeek + 1
if year == 2017 and grid_row['id'] == crime_row['grid']:
countYear=countYear+1
#Update the output for the specific grid
output = output.append({'Grid': grid_row['id'], 'Hotspot': 0, 'week': countWeek, 'month':
countMonth, 'year': countYear,'near': near}, ignore_index=True)
countMonth = 0
countYear = 0
countWeek = 0
现在这段代码循环遍历每个网格(总共 3446 个),并在每个网格中循环遍历每个犯罪(大约 18,000 个),计算总数并将其附加到 pandas 数据框...3446* 18000 是创建此数据集的大约 6200 万次计算。我觉得这不会花太长时间,但比理想情况下要花更长的时间。
关于如何有效加速的任何想法?我需要 运行 过去三年的每个月都使用这个算法,所以 36 次每次超过 5 小时 运行 时间对于我的时间限制来说太长了。
提前感谢您的任何见解。
编辑:澄清 'grid_row' 是网格 CSV 文件中的每条记录,我 post 编辑了上面的列(每个网格和相邻网格的位置)并且 'crime_row' 是a 过去一年内发生的每起犯罪事件:
你做事的方式可以简化为
forall grid
forall crimes
if crime.cell == grid.cell
do something
那个复杂度是O(|grid| * |crimes|)
如果您有 3k 次犯罪和 5k 网格,这会使它进行 15e6 次迭代
更好的方法是遍历犯罪并将它们中的任何一个推送到其关联的网格,将具有相同 grid_index 的所有犯罪堆叠到...相同的位置
gridIdxToCrimes = {} // to a grid_index you associate all the crimes
for crime_row in crime.iterrows():
grid_index = crime_row['grid']
if grid_index not in gridIdxToCrimes:
gridIdxToCrimes[grid_index] = []
gridIdxToCrimes[grid_index].push(crime_row)
forall grid_index, grid_row in grid.iterrows():
topIndex = grid_row['top ']
if topIndex in gridIdxToCrimes:
# you get all the crimes above your current grid
near += count(gridIdxToCrimes[topIndex])
这样你做了 O(|crimes|+|grid|) = 5k 次迭代
如果这是 post 这个问题的错误地方,请提前致歉。如果有更好的堆栈交换站点,请告诉我。
因此目前正在开发一种犯罪预测算法,该算法实质上是在城市上方放置一个网格,并预测每个网格条目在接下来的 30 天内是否会成为热点(至少发生一起袭击犯罪)。
我目前使用的是纳什维尔市,其网格覆盖有 3446 个网格。我有一个网格数据集,其中包含显示网格所需的所有数据、每个网格的地图坐标以及它周围的相邻网格(底部相邻、右侧相邻等)
以下是预测结果的示例:
在这种情况下,绿色表示正确预测。红色表示假阴性,紫色表示机器学习算法的假阳性。
为了训练我的神经网络,我使用了如下所示的功能集:
这里的 Hotspot 是目标值(1 和 0)。周、月、年是从去年(上周、上个月和去年发生的犯罪)中提取的犯罪事件的犯罪统计。我的问题是创建这些功能集需要花费大量时间(脚本需要 6 个多小时)
#Loop through each grid in the dataset
for grid_index, grid_row in grid.iterrows():
print("On grid number: ", grid_row['id'])
near=0
#Loop through all of the crimes
for crime_index, crime_row in crime.iterrows():
#Parse out the month, day, and year
date = crime_row['Incident Occurred']
date_pars = date.split('/')
month = int(date_pars[0])
day= int(date_pars[1])
year =int(date_pars[2].split(' ')[0])
if grid_row['top '] == crime_row['grid']:
near +=1
if grid_row['bottom '] == crime_row['grid']:
near +=1
if grid_row['left '] == crime_row['grid']:
near +=1
if grid_row['right '] == crime_row['grid']:
near +=1
if grid_row['topleft'] == crime_row['grid']:
near +=1
if grid_row['topright'] == crime_row['grid']:
near +=1
if grid_row['bottomright'] == crime_row['grid']:
near +=1
if grid_row['bottomleft'] == crime_row['grid']:
near +=1
if month == 12 and grid_row['id'] == crime_row['grid']:
countMonth = countMonth+1
if day >= 25 and month == 12 and grid_row['id'] == crime_row['grid']:
countWeek = countWeek + 1
if year == 2017 and grid_row['id'] == crime_row['grid']:
countYear=countYear+1
#Update the output for the specific grid
output = output.append({'Grid': grid_row['id'], 'Hotspot': 0, 'week': countWeek, 'month':
countMonth, 'year': countYear,'near': near}, ignore_index=True)
countMonth = 0
countYear = 0
countWeek = 0
现在这段代码循环遍历每个网格(总共 3446 个),并在每个网格中循环遍历每个犯罪(大约 18,000 个),计算总数并将其附加到 pandas 数据框...3446* 18000 是创建此数据集的大约 6200 万次计算。我觉得这不会花太长时间,但比理想情况下要花更长的时间。
关于如何有效加速的任何想法?我需要 运行 过去三年的每个月都使用这个算法,所以 36 次每次超过 5 小时 运行 时间对于我的时间限制来说太长了。
提前感谢您的任何见解。
编辑:澄清 'grid_row' 是网格 CSV 文件中的每条记录,我 post 编辑了上面的列(每个网格和相邻网格的位置)并且 'crime_row' 是a 过去一年内发生的每起犯罪事件:
你做事的方式可以简化为
forall grid
forall crimes
if crime.cell == grid.cell
do something
那个复杂度是O(|grid| * |crimes|)
如果您有 3k 次犯罪和 5k 网格,这会使它进行 15e6 次迭代
更好的方法是遍历犯罪并将它们中的任何一个推送到其关联的网格,将具有相同 grid_index 的所有犯罪堆叠到...相同的位置
gridIdxToCrimes = {} // to a grid_index you associate all the crimes
for crime_row in crime.iterrows():
grid_index = crime_row['grid']
if grid_index not in gridIdxToCrimes:
gridIdxToCrimes[grid_index] = []
gridIdxToCrimes[grid_index].push(crime_row)
forall grid_index, grid_row in grid.iterrows():
topIndex = grid_row['top ']
if topIndex in gridIdxToCrimes:
# you get all the crimes above your current grid
near += count(gridIdxToCrimes[topIndex])
这样你做了 O(|crimes|+|grid|) = 5k 次迭代