如何加入位置数据(纬度,经度)
How to join location data (lat,lon)
我需要一个数据集,一个有某个位置(纬度、经度),那是测试,还有一个有纽约市所有邮政编码的 lat/lon 信息,那是测试 2。
test <- structure(list(trip_count = 1:10, dropoff_longitude = c(-73.959862,
-73.882202, -73.934113, -73.992203, -74.00563, -73.975189, -73.97448,
-73.974838, -73.981377, -73.955093), dropoff_latitude = c(40.773617,
40.744175, 40.715923, 40.749203, 40.726158, 40.729824, 40.763599,
40.754135, 40.759987, 40.765224)), row.names = c(NA, -10L), class = c("tbl_df",
"tbl", "data.frame"))
test2 <- structure(list(latitude = c(40.853017, 40.791586, 40.762174,
40.706903, 40.825727, 40.739022, 40.750824, 40.673138, 40.815559,
40.754591), longitude = c(-73.91214, -73.94575, -73.94917, -73.82973,
-73.81752, -73.98205, -73.99289, -73.81443, -73.90771, -73.976238
), borough = c("Bronx", "Manhattan", "Manhattan", "Queens", "Bronx",
"Manhattan", "Manhattan", "Queens", "Bronx", "Manhattan")), class = "data.frame", row.names = c(NA,
-10L))
我现在正在尝试加入这两个数据集,以便最终每个 trip_count
我得到一个 borough
。到目前为止,我使用 difference_left_join
是这样的:
test %>% fuzzyjoin::difference_left_join(test2,by = c("dropoff_longitude" = "longitude" , "dropoff_latitude" = "latitude"), max_dist = 0.01)
尽管这种方法可行,但随着数据集变大,此连接会创建很多多个匹配项,因此我最终得到的数据集有时是初始数据集的十倍 test
。有没有人有不同的方法来解决这个问题而不创建多堆匹配?或者有什么方法可以强制连接始终只对 test
中的每一行使用一个匹配项?我将不胜感激!
编辑:解决这个问题R dplyr left join - multiple returned values and new rows: how to ask for the first match only?也会解决我的问题。所以也许你们中有人对此有想法!
你可以使用 geo_join
函数和 return 匹配之间的距离,然后过滤到最接近的匹配。
library(fuzzyjoin)
library(dplyr)
answer <-geo_left_join(test, test2, by = c("dropoff_longitude" = "longitude" , "dropoff_latitude" = "latitude"),
max_dist = 2, distance_col = "dist") %>%
select(-"longitude", -"latitude")
answer %>% group_by(trip_count) %>% slice_min(dist)
您可能需要将“max_dist”的值调低以减少匹配数,这应该会提高性能但可能会生成过多的 NA。
更新
四舍五入至小数点后 3 位最多为 70 meter/230 英尺偏移量。四舍五入到较少的小数位数会减少唯一点的数量,但会增加最大偏移量。
以下是我将如何处理舍入下车位置和执行连接的方法。它增加了复杂性,但可能有助于解决内存问题。我在这里没有考虑 group_by
函数,但它也可以。
#create a unique id for each rounded lon & lat
test$hash <-paste(round(test$dropoff_longitude, 3), round(test$dropoff_latitude, 3))
#the unique ids
uniques <- which(!duplicated(test$hash))
#create a reduced size data frame
reduced <- data.frame(hash= test$hash,
dropoff_longitude = round(test$dropoff_longitude, 3),
dropoff_latitude = round(test$dropoff_latitude, 3))[uniques,]
#Preform matching here
#using the formula above or something else.
# adding the matched column onto the reduced dataframe
reduced$matched <- letters[1:nrow(reduced)]
#this example is just adding on a column of letters
#merge back to the original adata set
test %>% left_join(reduced[ , c("hash", "matched")], by=("hash"))
我需要一个数据集,一个有某个位置(纬度、经度),那是测试,还有一个有纽约市所有邮政编码的 lat/lon 信息,那是测试 2。
test <- structure(list(trip_count = 1:10, dropoff_longitude = c(-73.959862,
-73.882202, -73.934113, -73.992203, -74.00563, -73.975189, -73.97448,
-73.974838, -73.981377, -73.955093), dropoff_latitude = c(40.773617,
40.744175, 40.715923, 40.749203, 40.726158, 40.729824, 40.763599,
40.754135, 40.759987, 40.765224)), row.names = c(NA, -10L), class = c("tbl_df",
"tbl", "data.frame"))
test2 <- structure(list(latitude = c(40.853017, 40.791586, 40.762174,
40.706903, 40.825727, 40.739022, 40.750824, 40.673138, 40.815559,
40.754591), longitude = c(-73.91214, -73.94575, -73.94917, -73.82973,
-73.81752, -73.98205, -73.99289, -73.81443, -73.90771, -73.976238
), borough = c("Bronx", "Manhattan", "Manhattan", "Queens", "Bronx",
"Manhattan", "Manhattan", "Queens", "Bronx", "Manhattan")), class = "data.frame", row.names = c(NA,
-10L))
我现在正在尝试加入这两个数据集,以便最终每个 trip_count
我得到一个 borough
。到目前为止,我使用 difference_left_join
是这样的:
test %>% fuzzyjoin::difference_left_join(test2,by = c("dropoff_longitude" = "longitude" , "dropoff_latitude" = "latitude"), max_dist = 0.01)
尽管这种方法可行,但随着数据集变大,此连接会创建很多多个匹配项,因此我最终得到的数据集有时是初始数据集的十倍 test
。有没有人有不同的方法来解决这个问题而不创建多堆匹配?或者有什么方法可以强制连接始终只对 test
中的每一行使用一个匹配项?我将不胜感激!
编辑:解决这个问题R dplyr left join - multiple returned values and new rows: how to ask for the first match only?也会解决我的问题。所以也许你们中有人对此有想法!
你可以使用 geo_join
函数和 return 匹配之间的距离,然后过滤到最接近的匹配。
library(fuzzyjoin)
library(dplyr)
answer <-geo_left_join(test, test2, by = c("dropoff_longitude" = "longitude" , "dropoff_latitude" = "latitude"),
max_dist = 2, distance_col = "dist") %>%
select(-"longitude", -"latitude")
answer %>% group_by(trip_count) %>% slice_min(dist)
您可能需要将“max_dist”的值调低以减少匹配数,这应该会提高性能但可能会生成过多的 NA。
更新
四舍五入至小数点后 3 位最多为 70 meter/230 英尺偏移量。四舍五入到较少的小数位数会减少唯一点的数量,但会增加最大偏移量。
以下是我将如何处理舍入下车位置和执行连接的方法。它增加了复杂性,但可能有助于解决内存问题。我在这里没有考虑 group_by
函数,但它也可以。
#create a unique id for each rounded lon & lat
test$hash <-paste(round(test$dropoff_longitude, 3), round(test$dropoff_latitude, 3))
#the unique ids
uniques <- which(!duplicated(test$hash))
#create a reduced size data frame
reduced <- data.frame(hash= test$hash,
dropoff_longitude = round(test$dropoff_longitude, 3),
dropoff_latitude = round(test$dropoff_latitude, 3))[uniques,]
#Preform matching here
#using the formula above or something else.
# adding the matched column onto the reduced dataframe
reduced$matched <- letters[1:nrow(reduced)]
#this example is just adding on a column of letters
#merge back to the original adata set
test %>% left_join(reduced[ , c("hash", "matched")], by=("hash"))