根据干预机会对两个坐标之间的行程进行排名的最快方法?
Fastest way to rank trips between two coordinates based on intervening opportunities?
我的数据框有超过 5000 个地点 lat/lon 坐标,我有另一个数据框有超过 23000 次旅行,原始 lat/lon 坐标和目的地 lat/lon 坐标。
所有地点和行程都在捷克布拉格地区。
我想根据干预机会计算每次旅行的排名 - 基于离起点更近的所有其他地方的总和。机会的方向并不重要。
我试过嵌套循环来创建地点之间所有距离的列表,但速度太慢了。 (在 10 小时内获得第 80 名)
for (row in 73:nrow(dataset_2015_POI_prg)) {
print(row)
id <- toString(dataset_2015_POI_prg[row, "venue_id"])
lat <- dataset_2015_POI_prg[row, "venue_lat"]
lon <- dataset_2015_POI_prg[row, "venue_lon"]
for (innerrow in 1:nrow(dataset_2015_POI_prg)) {
innerid <- toString(dataset_2015_POI_prg[innerrow, "venue_id"])
if (id != innerid && length(which(dataset_2015_POI_mix$from_venue_id == innerid & dataset_2015_POI_mix$to_venue_id == id)) == 0) {
print(innerrow)
innerlat <- dataset_2015_POI_prg[innerrow, "venue_lat"]
innerlon <- dataset_2015_POI_prg[innerrow, "venue_lon"]
dist <- distm(c(lon, lat), c(innerlon, innerlat), fun = distHaversine)
dataset_2015_POI_mix[nrow(dataset_2015_POI_mix) + 1,] = list(id, lat, lon, innerid, innerlat, innerlon, as.numeric(dist))
}
}
}
行程数据框
user_id from_lat from_lon to_lat to_lon distance
159493 50.08017 14.50109 50.09171 14.54276 3241.884096
159493 50.09171 14.54276 50.09076 14.54271 106.390784
159493 50.09076 14.54271 50.11302 14.61078 5456.33700
...
放置数据框
venue_id venue_lat venue_lon
4adcda9 50.08096 14.42810
...
正确且最快的方法是什么?预期结果是具有新列等级的旅行的新数据框,这是比目的地更接近原始地点的所有地点的总和。
非常感谢,我是 R 的新手:)
编辑:
源文件对于 pastebin 来说太大了,所以它们在这里:
行程:http://data.krysp.in/trips.txt
地点 http://data.krysp.in/pois.txt
编辑2:
dput()
个较小的数据示例
地点:
structure(list(venue_lat = c(50.09171, 50.090755, 50.113024,
50.113251, 50.103708, 50.080167, 50.108774, 50.113106, 50.081854,
50.104832, 50.090597, 50.113026, 50.068476, 50.113124, 50.10815,
50.060503), venue_lon = c(14.542765, 14.542707, 14.610781, 14.611714,
14.490623, 14.501095, 14.577527, 14.611648, 14.500505, 14.476009,
14.541811, 14.611271, 14.404627, 14.611779, 14.583479, 14.506008
)), row.names = c(NA, 16L), class = "data.frame")
行程
structure(list(user_id = c(159493, 159493, 159493, 159493, 159493,
159493, 159493, 159493, 159493, 159493, 159493, 159493, 159493,
159493, 159493, 159493, 159493, 159493, 159493, 159493, 159493,
159493, 159493, 159493, 159493, 159493, 159493, 159493, 159493,
159493), from_lat = c(50.080167, 50.09171, 50.090755, 50.113024,
50.113251, 50.113024, 50.103708, 50.080167, 50.108774, 50.113024,
50.113106, 50.09171, 50.080167, 50.081854, 50.113106, 50.113024,
50.104832, 50.09171, 50.090597, 50.113024, 50.09171, 50.113026,
50.113024, 50.068476, 50.113124, 50.113024, 50.09171, 50.113024,
50.10815, 50.09171), from_lon = c(14.501095, 14.542765, 14.542707,
14.610781, 14.611714, 14.610781, 14.490623, 14.501095, 14.577527,
14.610781, 14.611648, 14.542765, 14.501095, 14.500505, 14.611648,
14.610781, 14.476009, 14.542765, 14.541811, 14.610781, 14.542765,
14.611271, 14.610781, 14.404627, 14.611779, 14.610781, 14.542765,
14.610781, 14.583479, 14.542765), from_timestamp = c(10284, 58919,
58960, 82576, 197020, 1520404, 1539221, 1581079, 1585186, 1586688,
1586730, 1615656, 1637753, 1640134, 1643362, 1643399, 1659750,
1756952, 1765592, 1870541, 2000993, 2008701, 2008728, 2541997,
2653448, 2659355, 2682234, 2727528, 2822921, 2852025), to_lat = c(50.09171,
50.090755, 50.113024, 50.113251, 50.113024, 50.103708, 50.080167,
50.108774, 50.113024, 50.113106, 50.09171, 50.080167, 50.081854,
50.113106, 50.113024, 50.104832, 50.09171, 50.090597, 50.113024,
50.09171, 50.113026, 50.113024, 50.068476, 50.113124, 50.113024,
50.09171, 50.113024, 50.10815, 50.09171, 50.060503), to_lon = c(14.542765,
14.542707, 14.610781, 14.611714, 14.610781, 14.490623, 14.501095,
14.577527, 14.610781, 14.611648, 14.542765, 14.501095, 14.500505,
14.611648, 14.610781, 14.476009, 14.542765, 14.541811, 14.610781,
14.542765, 14.611271, 14.610781, 14.404627, 14.611779, 14.610781,
14.542765, 14.610781, 14.583479, 14.542765, 14.506008), to_timestamp = c(58919,
58960, 82576, 197020, 1520404, 1539221, 1581079, 1585186, 1586688,
1586730, 1615656, 1637753, 1640134, 1643362, 1643399, 1659750,
1756952, 1765592, 1870541, 2000993, 2008701, 2008728, 2541997,
2653448, 2659355, 2682234, 2727528, 2822921, 2852025, 3185844
), distance = c(3241.88409599252, 106.39078390924, 5456.33700758756,
71.2359425785903, 71.2359425785903, 8640.94151730882, 2725.20357275113,
6319.36823149692, 2420.67310365364, 62.5615027825454, 5464.76021027322,
3241.88409599252, 192.467213137768, 8665.6776299234, 62.5615027825454,
9664.83271725758, 4985.72628636199, 141.396853243087, 5521.3444368517,
5405.101536154, 5436.65634829112, 34.9800594737613, 15536.1468890647,
15607.2384436169, 72.1080346201786, 5405.101536154, 5405.101536154,
2023.20076623555, 3435.28409601096, 4354.77279195115)), row.names = c("2",
"3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13", "14",
"15", "16", "17", "18", "19", "20", "21", "22", "23", "24", "25",
"26", "27", "28", "29", "30", "31"), class = "data.frame")
我不太确定我是否理解 "ranking the destinantions based on intervening opportunities" 的意思。这当然意味着同一目的地可以根据用户的不同来源而具有不同的等级。另外我不确定 "intervening" 是否应该暗示起点和终点之间的方向。
无论如何,这就是我得到的:
提案 1(考虑方向)
准备trips
df
library(sf)
buffers <- list()
for(i in 1:nrow(trips)) {
buffers[[i]] <- st_buffer(st_linestring(matrix(as.numeric(trips[i, c(3,2,6,5)]), ncol = 2, byrow = T)), dist = 0.01)
}
buffer_sfc <- st_sfc(buffers, crs = 4326)
sf_trips <- st_sf(trips, geometry = buffer_sfc)
准备dest
df
sf_dest <- st_as_sf(x = dest, coords = c("venue_lon", "venue_lat"), crs = 4326)
创建队伍
res <- st_contains(sf_trips, sf_dest)
trips$rank <- sapply(res, length)
这就是它的作用:用一条直线连接起点和终点,并在其周围创建一个多边形。那么位于该多边形中的所有其他目标点都是 "intervening"。您可以通过 st_buffer
中的 dist =
参数调整多边形的大小,具体取决于与直接连接的偏差仍然符合 "intervening".
我非常有信心这 运行 比您的代码快。如果 "intervening" 但是你的意思是任何靠近原点的地方,无论方向如何,你都可以这样做:
提案 2
library(RANN)
intv_ops <- list()
for(i in 1:nrow(trips)) {
intv_ops[[i]] <- nn2(dest, trips[i, 2:3], searchtype = "radius", radius = (trips$distance[i]/1.11) * 0.00001)$nn.idx
}
trips$rank <- sapply(intv_ops, function(x) sum(x != 0))
nn2
是用 C++ 编写的 knn 算法的包装器,所以速度非常快。
我的数据框有超过 5000 个地点 lat/lon 坐标,我有另一个数据框有超过 23000 次旅行,原始 lat/lon 坐标和目的地 lat/lon 坐标。
所有地点和行程都在捷克布拉格地区。
我想根据干预机会计算每次旅行的排名 - 基于离起点更近的所有其他地方的总和。机会的方向并不重要。
我试过嵌套循环来创建地点之间所有距离的列表,但速度太慢了。 (在 10 小时内获得第 80 名)
for (row in 73:nrow(dataset_2015_POI_prg)) {
print(row)
id <- toString(dataset_2015_POI_prg[row, "venue_id"])
lat <- dataset_2015_POI_prg[row, "venue_lat"]
lon <- dataset_2015_POI_prg[row, "venue_lon"]
for (innerrow in 1:nrow(dataset_2015_POI_prg)) {
innerid <- toString(dataset_2015_POI_prg[innerrow, "venue_id"])
if (id != innerid && length(which(dataset_2015_POI_mix$from_venue_id == innerid & dataset_2015_POI_mix$to_venue_id == id)) == 0) {
print(innerrow)
innerlat <- dataset_2015_POI_prg[innerrow, "venue_lat"]
innerlon <- dataset_2015_POI_prg[innerrow, "venue_lon"]
dist <- distm(c(lon, lat), c(innerlon, innerlat), fun = distHaversine)
dataset_2015_POI_mix[nrow(dataset_2015_POI_mix) + 1,] = list(id, lat, lon, innerid, innerlat, innerlon, as.numeric(dist))
}
}
}
行程数据框
user_id from_lat from_lon to_lat to_lon distance
159493 50.08017 14.50109 50.09171 14.54276 3241.884096
159493 50.09171 14.54276 50.09076 14.54271 106.390784
159493 50.09076 14.54271 50.11302 14.61078 5456.33700
...
放置数据框
venue_id venue_lat venue_lon
4adcda9 50.08096 14.42810
...
正确且最快的方法是什么?预期结果是具有新列等级的旅行的新数据框,这是比目的地更接近原始地点的所有地点的总和。
非常感谢,我是 R 的新手:)
编辑: 源文件对于 pastebin 来说太大了,所以它们在这里: 行程:http://data.krysp.in/trips.txt 地点 http://data.krysp.in/pois.txt
编辑2:
dput()
个较小的数据示例
地点:
structure(list(venue_lat = c(50.09171, 50.090755, 50.113024,
50.113251, 50.103708, 50.080167, 50.108774, 50.113106, 50.081854,
50.104832, 50.090597, 50.113026, 50.068476, 50.113124, 50.10815,
50.060503), venue_lon = c(14.542765, 14.542707, 14.610781, 14.611714,
14.490623, 14.501095, 14.577527, 14.611648, 14.500505, 14.476009,
14.541811, 14.611271, 14.404627, 14.611779, 14.583479, 14.506008
)), row.names = c(NA, 16L), class = "data.frame")
行程
structure(list(user_id = c(159493, 159493, 159493, 159493, 159493,
159493, 159493, 159493, 159493, 159493, 159493, 159493, 159493,
159493, 159493, 159493, 159493, 159493, 159493, 159493, 159493,
159493, 159493, 159493, 159493, 159493, 159493, 159493, 159493,
159493), from_lat = c(50.080167, 50.09171, 50.090755, 50.113024,
50.113251, 50.113024, 50.103708, 50.080167, 50.108774, 50.113024,
50.113106, 50.09171, 50.080167, 50.081854, 50.113106, 50.113024,
50.104832, 50.09171, 50.090597, 50.113024, 50.09171, 50.113026,
50.113024, 50.068476, 50.113124, 50.113024, 50.09171, 50.113024,
50.10815, 50.09171), from_lon = c(14.501095, 14.542765, 14.542707,
14.610781, 14.611714, 14.610781, 14.490623, 14.501095, 14.577527,
14.610781, 14.611648, 14.542765, 14.501095, 14.500505, 14.611648,
14.610781, 14.476009, 14.542765, 14.541811, 14.610781, 14.542765,
14.611271, 14.610781, 14.404627, 14.611779, 14.610781, 14.542765,
14.610781, 14.583479, 14.542765), from_timestamp = c(10284, 58919,
58960, 82576, 197020, 1520404, 1539221, 1581079, 1585186, 1586688,
1586730, 1615656, 1637753, 1640134, 1643362, 1643399, 1659750,
1756952, 1765592, 1870541, 2000993, 2008701, 2008728, 2541997,
2653448, 2659355, 2682234, 2727528, 2822921, 2852025), to_lat = c(50.09171,
50.090755, 50.113024, 50.113251, 50.113024, 50.103708, 50.080167,
50.108774, 50.113024, 50.113106, 50.09171, 50.080167, 50.081854,
50.113106, 50.113024, 50.104832, 50.09171, 50.090597, 50.113024,
50.09171, 50.113026, 50.113024, 50.068476, 50.113124, 50.113024,
50.09171, 50.113024, 50.10815, 50.09171, 50.060503), to_lon = c(14.542765,
14.542707, 14.610781, 14.611714, 14.610781, 14.490623, 14.501095,
14.577527, 14.610781, 14.611648, 14.542765, 14.501095, 14.500505,
14.611648, 14.610781, 14.476009, 14.542765, 14.541811, 14.610781,
14.542765, 14.611271, 14.610781, 14.404627, 14.611779, 14.610781,
14.542765, 14.610781, 14.583479, 14.542765, 14.506008), to_timestamp = c(58919,
58960, 82576, 197020, 1520404, 1539221, 1581079, 1585186, 1586688,
1586730, 1615656, 1637753, 1640134, 1643362, 1643399, 1659750,
1756952, 1765592, 1870541, 2000993, 2008701, 2008728, 2541997,
2653448, 2659355, 2682234, 2727528, 2822921, 2852025, 3185844
), distance = c(3241.88409599252, 106.39078390924, 5456.33700758756,
71.2359425785903, 71.2359425785903, 8640.94151730882, 2725.20357275113,
6319.36823149692, 2420.67310365364, 62.5615027825454, 5464.76021027322,
3241.88409599252, 192.467213137768, 8665.6776299234, 62.5615027825454,
9664.83271725758, 4985.72628636199, 141.396853243087, 5521.3444368517,
5405.101536154, 5436.65634829112, 34.9800594737613, 15536.1468890647,
15607.2384436169, 72.1080346201786, 5405.101536154, 5405.101536154,
2023.20076623555, 3435.28409601096, 4354.77279195115)), row.names = c("2",
"3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13", "14",
"15", "16", "17", "18", "19", "20", "21", "22", "23", "24", "25",
"26", "27", "28", "29", "30", "31"), class = "data.frame")
我不太确定我是否理解 "ranking the destinantions based on intervening opportunities" 的意思。这当然意味着同一目的地可以根据用户的不同来源而具有不同的等级。另外我不确定 "intervening" 是否应该暗示起点和终点之间的方向。
无论如何,这就是我得到的:
提案 1(考虑方向)
准备trips
df
library(sf)
buffers <- list()
for(i in 1:nrow(trips)) {
buffers[[i]] <- st_buffer(st_linestring(matrix(as.numeric(trips[i, c(3,2,6,5)]), ncol = 2, byrow = T)), dist = 0.01)
}
buffer_sfc <- st_sfc(buffers, crs = 4326)
sf_trips <- st_sf(trips, geometry = buffer_sfc)
准备dest
df
sf_dest <- st_as_sf(x = dest, coords = c("venue_lon", "venue_lat"), crs = 4326)
创建队伍
res <- st_contains(sf_trips, sf_dest)
trips$rank <- sapply(res, length)
这就是它的作用:用一条直线连接起点和终点,并在其周围创建一个多边形。那么位于该多边形中的所有其他目标点都是 "intervening"。您可以通过 st_buffer
中的 dist =
参数调整多边形的大小,具体取决于与直接连接的偏差仍然符合 "intervening".
我非常有信心这 运行 比您的代码快。如果 "intervening" 但是你的意思是任何靠近原点的地方,无论方向如何,你都可以这样做:
提案 2
library(RANN)
intv_ops <- list()
for(i in 1:nrow(trips)) {
intv_ops[[i]] <- nn2(dest, trips[i, 2:3], searchtype = "radius", radius = (trips$distance[i]/1.11) * 0.00001)$nn.idx
}
trips$rank <- sapply(intv_ops, function(x) sum(x != 0))
nn2
是用 C++ 编写的 knn 算法的包装器,所以速度非常快。