输出时忽略重复条目的子集基函数

Question

我最近询问了如何使用字典文件对数据集中的值进行重新编码 ()

我遇到了一个更简单的问题，但该修复程序不起作用。假设我有以下数据集，每一行都是一个地理单元，V1 列列出了地理包找到的“第一个邻居”，但是使用行号:

V1 <- c(1, 2, 1)
id <- c(110001, 110002, 110003)
dataset <- as.data.frame(matrix(c(id, V1), ncol=2))
colnames(dataset) <- c("id", "V1")

所以在这个数据集上，区域 110001 是它自己的邻居 (V1 = 1)，而区域 110003 是 110001 (V1 = 1) 的邻居。现在，我不希望将 V1（第一个邻居）显示为 "1, 2, 1"，而是将其显示为 "110001, 110002, 110001".

地理区域的 id

因此，我创建了一个“字典”文件，其中包含地理区域的行号和 ID：

dictionary <- as.data.frame(matrix(c(dataset$id, 1:nrow(dataset)),ncol=2))
colnames(dictionary) <- c("id","row")

然后，我尝试使用 mutate 映射这些。请注意，我有许多邻域变量 (V1-V30)，并且我在示例中只使用了一个，因此我将使用转换为所有的语法：

new_dataset <- dataset %>% mutate(across(starts_with("V"), ~subset(dictionary, row == cur_column(), select= id)))

这应该做的是：运行跨列，将值与字典行的值进行比较，然后return适当的id。问题似乎出在 dataset$V1 中重复的条目（在本例中，第 1 行和第 3 行等于“1”）。如果我逐行进行，这将起作用：

first_row <- dataset[1,] %>% mutate(V1 = subset(dictionary, row == V1, select= id))    
second_row <- dataset[2,] %>% mutate(V1 = subset(dictionary, row == V1, select= id))  
third_row <- dataset[3,] %>% mutate(V1 = subset(dictionary, row == V1, select= id))

我的印象是“子集”忽略了重复的条目。例如，如果我运行这个：

 subset(dictionary, row == dataset$V1, select= id)

应该return"110001, 110002, 110001"，但只有return"110001, 110002".

关于如何使子集 return 成为一切或其他方法的任何想法？

Answer 1

我们可以用rowwise

library(dplyr)
dataset %>%
     rowwise %>% 
     mutate(V1 = subset(dictionary, row == V1, select= id)$id) %>%
     ungroup

-输出

# A tibble: 3 x 2
      id     V1
   <dbl>  <dbl>
1 110001 110001
2 110002 110002
3 110003 110001

或 data.table

library(data.table)
 setDT(dataset)[dictionary, V1 := i.id, on = .(V1 = row)]
> dataset
       id     V1
1: 110001 110001
2: 110002 110002
3: 110003 110001

如果有多个列，例如'V1'、'V2' 等

dataset$V2 <- V1[c(1, 3, 2)]
nm1 <- paste0("V", 1:2)
setDT(dataset)
for(nm in nm1) 
   dataset[dictionary, (nm) := i.id, on = setNames("row", nm)][]

-输出

> dataset
       id     V1     V2
1: 110001 110001 110001
2: 110002 110002 110001
3: 110003 110001 110002

Answer 2

您可以对修改后的数据集使用自左连接：

library(dplyr)

dataset %>% 
  left_join(
    dataset %>% 
      group_by(V1) %>% 
      slice(1),
    by = "V1") %>% 
  select(-V1)

这个returns

    id.x   id.y
1 110001 110001
2 110002 110002
3 110003 110001

Answer 3

也许像下面这样的基本 R 选项？

transform(
  dataset,
  V1 = ave(id, V1, FUN = function(x) head(x, 1))
)

这给出了

      id     V1
1 110001 110001
2 110002 110002
3 110003 110001

输出时忽略重复条目的子集基函数

Subset base function ignoring repeated entries when outputting

r

subset

matching

dataframe