R 中的 which() 函数 - 按降序排序后,出现与重复值匹配的问题
which() function in R - after sorting in descending order, issues matching with duplicate values
我试图从商店 ID、邮政编码和每个邮政编码的 long/latitude 坐标矩阵中找到下一个最近的商店。当每个邮政编码有超过 1 个商店时会出现问题,并且脚本不知道如何对 2 个相同的值进行排序(商店 x 是 10 英里远,商店 y 是 10 英里,并且 x 和y,并返回 (c(x,y)),而不是 x,y 或 y,x)。我需要找到一种方法让我的代码弄清楚如何列出它们(任意顺序,因为它们与商店的距离相同,基于邮政编码)。
我认为可能会对 which() 函数进行修改,但我运气不好。
请注意,所有商店 运行,只有大约 100 家与另一家商店具有相同邮政编码的商店被绊倒了 - 我不想手动浏览和编辑 csv。
source<-read.csv("C:\Users\mcan\Desktop\Projects\Closest Store\Site and Zip.csv",header=TRUE, sep=",") #open
zip<-source[,2] #break apart the source zip codes
ID<-source[,1] #break apart the IDs
zip<-clean.zipcodes(zip) #clean up the zipcodes
CleanedData<-data.frame(ID,zip) #combine the IDs and cleaned Zip codes
CleanedData<-merge(x=CleanedData,y=zipcode,by="zip",all.x=TRUE) #dataset of store IDs, zipcodes, and their long/lat positions
setDT(CleanedData) #set data frame to data table
storeDistances <- distm(CleanedData[,.(longitude,latitude)],CleanedData[,.(longitude,latitude)]) #matrix between long/lat points of all stores in list
colnames(storeDistances) <- rownames(storeDistances) <- CleanedData[,ID]
whatsClosest <- function(number=1){
apply(storeDistances,1,function(x) (colnames(storeDistances)[which(x==sort(x)[number+1])])) #sorts in descending order and picks the 2nd closest distance, matches with storeID
CleanedData[,firstClosestSite:=whatsClosest(1)] #looks for 1st closest store
CleanedData[,secondClosestSite:=whatsClosest(2)] #looks for 2nd closest store
CleanedData[,thirdClosestSite:=whatsClosest(3)] #looks for 3rd closest store
Classes ‘data.table’ and 'data.frame': 1206 obs. of 9 variables:
$ zip : Factor w/ 1182 levels "01234","02345",..: 1 2 3 4 5 6 7 8 9 10 ...
$ ID : int 11111 12222 13333 10528 ...
$ city : chr "Boston" "Somerville" "Cambridge" "Weston" ...
$ state : chr "MA" "MA" "MA" "MA" ...
$ latitude : num 40.0 41.0 42.0 43.0 ...
$ longitude : num -70.0 -70.1 -70.2 -70.3 -70.4 ...
$ firstClosestSite :List of 1206
..$ : chr "12345"
$ secondClosestSite :List of 1206
..$ : chr "12344"
$ thirdClosestSite :List of 1206
..$ : chr "12343"
问题来自 firstClosestSite 和 secondClosest 站点,它们按距离排序,但如果距离相同,因为两个商店存在于同一邮政编码中,which() 函数(我认为)不知道如何考虑到这一点,所以你在 CSV 中得到了这个尴尬的连接:
StoreID Zip City State Longitude Latitude FirstClosestSite
11222 11000 Boston MA 40.0 -70.0 c("11111""12222")
SecondClosestSite ThirdClosestSite
c("11111" "12222") 13333
如何形成距离矩阵的示例(商店 ID 在第一行和第一列,矩阵值是商店 ID 之间的距离):
11111 22222 33333 44444 55555 66666
11111 0 6000 32000 36000 28000 28000
22222 6000 0 37500 40500 32000 32000
33333 32000 37500 0 11000 6900 6900
44444 36000 40500 11000 0 8900 8900
55555 28000 32000 6900 8900 0 0
66666 28000 32000 6900 8900 0 0
问题是每行重复...which() 不知道哪个商店最接近 11111(55555 或 66666)。
这是我尝试的解决方案。 colnames(storeDistances) <- ...
whatsClosestList <- sapply(as.data.frame(storeDistances), function(x) list(data.frame(distance = x, store = rownames(storeDistances), stringsAsFactors = F)))
# Get the names of the stores
# this step is necessary because lapply doesn't allow us
# to access the list names
storeNames = names(whatsClosestList)
# Iterate through each store's data frame using storeNames
# and delete the distance to itself
whatsClosestListRemoveSelf <- lapply(storeNames, function(name) {
df <- whatsClosestList[[name]]
df <- df[!df$store == name,]
# The previous step got rid of the store names in the list,
# so we add them again here
names(whatsClosestListRemoveSelf) <- storeNames
whatsClosestOrderedList <- lapply(whatsClosestListRemoveSelf, function(df) { df[order(df$distance),] })
whatsClosestTopThree <- lapply(whatsClosestOrderedList, function(df) { df$store[1:3] })
firstClosestSite <- lapply(whatsClosestTopThree, function(x) { x[1]} )
secondClosestSite <- lapply(whatsClosestTopThree, function(x) { x[2]} )
thirdClosestSite <- lapply(whatsClosestTopThree, function(x) { x[3]} )
CleanedData[,firstClosestSite:=firstClosestSite] #looks for 1st closest store in list
CleanedData[,secondClosestSite:=secondClosestSite] #looks for 2nd closest store in list
CleanedData[,thirdClosestSite:=thirdClosestSite] #looks for 3rd closest store in list
基本上,我不是只搜索(第一、第二、第三)最近的站点,而是为每家商店创建一个数据框列表,以及到所有其他商店的距离。然后我对这些数据框进行排序,并提取最近的三个商店,其中有时包括领带(如果领带,则按商店名称排序)。然后你只需要为每个商店提取一个包含 firstClosestSite,secondClosestSite 等的列表,这就是你在 CleanedData
我试图从商店 ID、邮政编码和每个邮政编码的 long/latitude 坐标矩阵中找到下一个最近的商店。当每个邮政编码有超过 1 个商店时会出现问题,并且脚本不知道如何对 2 个相同的值进行排序(商店 x 是 10 英里远,商店 y 是 10 英里,并且 x 和y,并返回 (c(x,y)),而不是 x,y 或 y,x)。我需要找到一种方法让我的代码弄清楚如何列出它们(任意顺序,因为它们与商店的距离相同,基于邮政编码)。
我认为可能会对 which() 函数进行修改,但我运气不好。
请注意,所有商店 运行,只有大约 100 家与另一家商店具有相同邮政编码的商店被绊倒了 - 我不想手动浏览和编辑 csv。
source<-read.csv("C:\Users\mcan\Desktop\Projects\Closest Store\Site and Zip.csv",header=TRUE, sep=",") #open
zip<-source[,2] #break apart the source zip codes
ID<-source[,1] #break apart the IDs
zip<-clean.zipcodes(zip) #clean up the zipcodes
CleanedData<-data.frame(ID,zip) #combine the IDs and cleaned Zip codes
CleanedData<-merge(x=CleanedData,y=zipcode,by="zip",all.x=TRUE) #dataset of store IDs, zipcodes, and their long/lat positions
setDT(CleanedData) #set data frame to data table
storeDistances <- distm(CleanedData[,.(longitude,latitude)],CleanedData[,.(longitude,latitude)]) #matrix between long/lat points of all stores in list
colnames(storeDistances) <- rownames(storeDistances) <- CleanedData[,ID]
whatsClosest <- function(number=1){
apply(storeDistances,1,function(x) (colnames(storeDistances)[which(x==sort(x)[number+1])])) #sorts in descending order and picks the 2nd closest distance, matches with storeID
CleanedData[,firstClosestSite:=whatsClosest(1)] #looks for 1st closest store
CleanedData[,secondClosestSite:=whatsClosest(2)] #looks for 2nd closest store
CleanedData[,thirdClosestSite:=whatsClosest(3)] #looks for 3rd closest store
Classes ‘data.table’ and 'data.frame': 1206 obs. of 9 variables:
$ zip : Factor w/ 1182 levels "01234","02345",..: 1 2 3 4 5 6 7 8 9 10 ...
$ ID : int 11111 12222 13333 10528 ...
$ city : chr "Boston" "Somerville" "Cambridge" "Weston" ...
$ state : chr "MA" "MA" "MA" "MA" ...
$ latitude : num 40.0 41.0 42.0 43.0 ...
$ longitude : num -70.0 -70.1 -70.2 -70.3 -70.4 ...
$ firstClosestSite :List of 1206
..$ : chr "12345"
$ secondClosestSite :List of 1206
..$ : chr "12344"
$ thirdClosestSite :List of 1206
..$ : chr "12343"
问题来自 firstClosestSite 和 secondClosest 站点,它们按距离排序,但如果距离相同,因为两个商店存在于同一邮政编码中,which() 函数(我认为)不知道如何考虑到这一点,所以你在 CSV 中得到了这个尴尬的连接:
StoreID Zip City State Longitude Latitude FirstClosestSite
11222 11000 Boston MA 40.0 -70.0 c("11111""12222")
SecondClosestSite ThirdClosestSite
c("11111" "12222") 13333
如何形成距离矩阵的示例(商店 ID 在第一行和第一列,矩阵值是商店 ID 之间的距离):
11111 22222 33333 44444 55555 66666
11111 0 6000 32000 36000 28000 28000
22222 6000 0 37500 40500 32000 32000
33333 32000 37500 0 11000 6900 6900
44444 36000 40500 11000 0 8900 8900
55555 28000 32000 6900 8900 0 0
66666 28000 32000 6900 8900 0 0
问题是每行重复...which() 不知道哪个商店最接近 11111(55555 或 66666)。
这是我尝试的解决方案。 colnames(storeDistances) <- ...
whatsClosestList <- sapply(as.data.frame(storeDistances), function(x) list(data.frame(distance = x, store = rownames(storeDistances), stringsAsFactors = F)))
# Get the names of the stores
# this step is necessary because lapply doesn't allow us
# to access the list names
storeNames = names(whatsClosestList)
# Iterate through each store's data frame using storeNames
# and delete the distance to itself
whatsClosestListRemoveSelf <- lapply(storeNames, function(name) {
df <- whatsClosestList[[name]]
df <- df[!df$store == name,]
# The previous step got rid of the store names in the list,
# so we add them again here
names(whatsClosestListRemoveSelf) <- storeNames
whatsClosestOrderedList <- lapply(whatsClosestListRemoveSelf, function(df) { df[order(df$distance),] })
whatsClosestTopThree <- lapply(whatsClosestOrderedList, function(df) { df$store[1:3] })
firstClosestSite <- lapply(whatsClosestTopThree, function(x) { x[1]} )
secondClosestSite <- lapply(whatsClosestTopThree, function(x) { x[2]} )
thirdClosestSite <- lapply(whatsClosestTopThree, function(x) { x[3]} )
CleanedData[,firstClosestSite:=firstClosestSite] #looks for 1st closest store in list
CleanedData[,secondClosestSite:=secondClosestSite] #looks for 2nd closest store in list
CleanedData[,thirdClosestSite:=thirdClosestSite] #looks for 3rd closest store in list
基本上,我不是只搜索(第一、第二、第三)最近的站点,而是为每家商店创建一个数据框列表,以及到所有其他商店的距离。然后我对这些数据框进行排序,并提取最近的三个商店,其中有时包括领带(如果领带,则按商店名称排序)。然后你只需要为每个商店提取一个包含 firstClosestSite,secondClosestSite 等的列表,这就是你在 CleanedData