R 中的 which() 函数 - 按降序排序后,出现与重复值匹配的问题

which() function in R - after sorting in descending order, issues matching with duplicate values

我试图从商店 ID、邮政编码和每个邮政编码的 long/latitude 坐标矩阵中找到下一个最近的商店。当每个邮政编码有超过 1 个商店时会出现问题,并且脚本不知道如何对 2 个相同的值进行排序(商店 x 是 10 英里远,商店 y 是 10 英里,并且 x 和y,并返回 (c(x,y)),而不是 x,y 或 y,x)。我需要找到一种方法让我的代码弄清楚如何列出它们(任意顺序,因为它们与商店的距离相同,基于邮政编码)。

我认为可能会对 which() 函数进行修改,但我运气不好。

请注意,所有商店 运行,只有大约 100 家与另一家商店具有相同邮政编码的商店被绊倒了 - 我不想手动浏览和编辑 csv。

library(data.table)
library(zipcode)
library(geosphere)
source<-read.csv("C:\Users\mcan\Desktop\Projects\Closest Store\Site and Zip.csv",header=TRUE, sep=",") #open
zip<-source[,2] #break apart the source zip codes 
ID<-source[,1] #break apart the IDs
zip<-clean.zipcodes(zip) #clean up the zipcodes 
CleanedData<-data.frame(ID,zip) #combine the IDs and cleaned Zip codes
CleanedData<-merge(x=CleanedData,y=zipcode,by="zip",all.x=TRUE) #dataset of store IDs, zipcodes, and their long/lat positions
setDT(CleanedData) #set data frame to data table 
storeDistances <- distm(CleanedData[,.(longitude,latitude)],CleanedData[,.(longitude,latitude)]) #matrix between long/lat points of all stores in list 
colnames(storeDistances) <- rownames(storeDistances) <- CleanedData[,ID] 
whatsClosest <- function(number=1){
    apply(storeDistances,1,function(x) (colnames(storeDistances)[which(x==sort(x)[number+1])])) #sorts in descending order and picks the 2nd closest distance, matches with storeID
}
CleanedData[,firstClosestSite:=whatsClosest(1)] #looks for 1st closest store
CleanedData[,secondClosestSite:=whatsClosest(2)] #looks for 2nd closest store
CleanedData[,thirdClosestSite:=whatsClosest(3)] #looks for 3rd closest store 

数据集格式:

 Classes ‘data.table’ and 'data.frame': 1206 obs. of  9 variables:
     $ zip              : Factor w/ 1182 levels "01234","02345",..: 1 2 3 4 5 6 7 8 9 10 ...
     $ ID               : int  11111 12222 13333 10528 ...
     $ city             : chr  "Boston" "Somerville" "Cambridge" "Weston" ...
     $ state            : chr  "MA" "MA" "MA" "MA" ...
     $ latitude         : num  40.0 41.0 42.0 43.0 ...
     $ longitude        : num  -70.0 -70.1 -70.2 -70.3 -70.4 ...
    $ firstClosestSite :List of 1206
      ..$ : chr "12345"
    $ secondClosestSite :List of 1206
      ..$ : chr "12344"
    $ thirdClosestSite :List of 1206
      ..$ : chr "12343"

问题来自 firstClosestSite 和 secondClosest 站点,它们按距离排序,但如果距离相同,因为两个商店存在于同一邮政编码中,which() 函数(我认为)不知道如何考虑到这一点,所以你在 CSV 中得到了这个尴尬的连接:

StoreID      Zip       City       State    Longitude  Latitude FirstClosestSite
11222       11000     Boston      MA       40.0       -70.0    c("11111""12222")
    
SecondClosestSite     ThirdClosestSite
c("11111"    "12222")   13333

如何形成距离矩阵的示例(商店 ID 在第一行和第一列,矩阵值是商店 ID 之间的距离):

    11111   22222     33333   44444   55555   66666
11111   0      6000    32000   36000  28000   28000
22222   6000    0      37500   40500  32000   32000
33333   32000   37500   0      11000   6900   6900
44444   36000   40500   11000   0     8900    8900
55555   28000   32000   6900    8900    0     0
66666   28000   32000   6900    8900    0     0

问题是每行重复...which() 不知道哪个商店最接近 11111(55555 或 66666)。

这是我尝试的解决方案。 colnames(storeDistances) <- ... 行之前的所有内容都保持不变。之后,您应该将代码替换为以下内容:

whatsClosestList <- sapply(as.data.frame(storeDistances), function(x) list(data.frame(distance = x, store = rownames(storeDistances), stringsAsFactors = F)))

# Get the names of the stores
# this step is necessary because lapply doesn't allow us
# to access the list names
storeNames = names(whatsClosestList)

# Iterate through each store's data frame using storeNames
# and delete the distance to itself
whatsClosestListRemoveSelf <- lapply(storeNames, function(name) {
  df <- whatsClosestList[[name]]
  df <- df[!df$store == name,]
})

# The previous step got rid of the store names in the list,
# so we add them again here
names(whatsClosestListRemoveSelf) <- storeNames

whatsClosestOrderedList <- lapply(whatsClosestListRemoveSelf, function(df) { df[order(df$distance),] })

whatsClosestTopThree <- lapply(whatsClosestOrderedList, function(df) { df$store[1:3] })

firstClosestSite <- lapply(whatsClosestTopThree, function(x) { x[1]} )
secondClosestSite <- lapply(whatsClosestTopThree, function(x) { x[2]} )
thirdClosestSite <- lapply(whatsClosestTopThree, function(x) { x[3]} )

CleanedData[,firstClosestSite:=firstClosestSite] #looks for 1st closest store in list
CleanedData[,secondClosestSite:=secondClosestSite] #looks for 2nd closest store in list 
CleanedData[,thirdClosestSite:=thirdClosestSite] #looks for 3rd closest store in list

基本上,我不是只搜索(第一、第二、第三)最近的站点,而是为每家商店创建一个数据框列表,以及到所有其他商店的距离。然后我对这些数据框进行排序,并提取最近的三个商店,其中有时包括领带(如果领带,则按商店名称排序)。然后你只需要为每个商店提取一个包含 firstClosestSite,secondClosestSite 等的列表,这就是你在 CleanedData 中搜索的原因。希望有用!