列表中所有数据框元素的最接近值和数据框索引索引

closest value and data frame index index of all data frame elements of a list

我有一个包含数据框的列表:

test <- list()
test[[1]] <- data.frame(C1=c(0.2,0.4,0.5), C2=c(2,3.5,3.7), C3=c(0.3,4,5))
test[[2]] <- data.frame(C1=c(0.1,0.3,0.6), C2=c(3.9,4.3,8), C3=c(3,5.2,10))
test[[3]] <- data.frame(C1=c(0.4,0.55,0.8), C2=c(8.9,10.3,14), C3=c(7,8.4,11))

我想获取此列表中所有数据框行中的哪一列(例如本例中的 C2)具有最接近向量“vec”(下方)中每个元素的值,以及它发生的列表索引(本例中为 1、2 或 3)。

vector <- c(3, 14.4, 7, 0)

想要的答案应该是这样的:

list.index    line.number.in.df    C1  C2 C3
     1              2              0.4 3.5 4 
     3              3              0.8 14 11
     2              3              0.6  8 10
     1              1              0.2  2 0.3

我可以设法使用 lapply 为单个值解决 10% 的问题,但是除了获取所有列表元素数据框行之外不能为一堆值(向量)做到这一点找到最接近的值(不仅是所有数据帧中的单行),也无法获得相应的列表索引,即

value <- 3
lapply(test, function(x) x[which.min(abs(value-x$C2)),])

我得到的结果:

[[1]]
  C1  C2 C3
2 0.4 3.5  4

[[2]]
  C1  C2 C3
1 0.1 3.9  3

[[3]]
  C1  C2 C3
1 0.4 8.9  7

有谁能如此友善和耐心地帮助我进一步了解这个问题吗?

提前致谢,新年快乐。

您可以利用 namessubstrings

(w <- sapply(v, \(v) 
            names(which.min(abs(unlist(setNames(test, seq_along(test))) - v)))))
# [1] "2.C31" "3.C23" "3.C31" "2.C11"

t(mapply(\(x, y) c(list=x, line=y, test[[x]][y, ]), 
         as.numeric(substr(w, 1, 1)), as.numeric(substring(w, 5)))) |> 
  as.data.frame()
#   list line  C1  C2 C3
# 1    2    1 0.1 3.9  3
# 2    3    3 0.8  14 11
# 3    3    1 0.4 8.9  7
# 4    2    1 0.1 3.9  3

注意: R >= 4.1 使用。


数据:

test <- list(structure(list(C1 = c(0.2, 0.4, 0.5), C2 = c(2, 3.5, 3.7
), C3 = c(0.3, 4, 5)), class = "data.frame", row.names = c(NA, 
-3L)), structure(list(C1 = c(0.1, 0.3, 0.6), C2 = c(3.9, 4.3, 
8), C3 = c(3, 5.2, 10)), class = "data.frame", row.names = c(NA, 
-3L)), structure(list(C1 = c(0.4, 0.55, 0.8), C2 = c(8.9, 10.3, 
14), C3 = c(7, 8.4, 11)), class = "data.frame", row.names = c(NA, 
-3L)))

v <- c(3, 14.4, 7, 0)

希望这就是您要找的。它在每个测试元素的列中找到最接近 vector.

中的值的值
#install.packages('birk')
library(birk) # required for which.closest()

# find which of the values across the columns C1:C3 in each element of test are closest
# to the values of vector and return the corresponding row numbers
x <- sapply(1:length(vector), \(x) sapply(test, \(i) apply(i, 2, \(j) which.closest(j, vector[x]))))
x <- apply(x, 1, \(x) as.data.frame(table(x)))
x <- lapply(x, \(i) i[which.max(i[, 2]), ])
row_numbers_df <- as.numeric(matrix(do.call(rbind, x)[['x']]))

# extract the values in each of the column C1:C3 corresponding to row_numbers_df
vals <- array(0, dim = length(row_numbers_df))
for (i in 1:length(row_numbers_df)) { vals[i] <- do.call(cbind, test)[row_numbers_df[i], i] }

# how many columns does each data.frame embedded in test have?
unique_number_of_cols <- unique(sapply(test, ncol))

# store results in a data.frame
r <- \(x) round(x, 1)
out <- data.frame(
  seq_len(length(test)),
  r(rowMeans(matrix(row_numbers_df, ncol = unique_number_of_cols, byrow = TRUE))),
  matrix(vals, ncol = unique_number_of_cols, byrow = TRUE)
)
names(out) <- c('list.index', 'line.number.in.df', sapply(test, colnames)[, 1])

结果

> out
  list.index line.number.in.df  C1  C2 C3
1          1               3.0 0.5 3.7  5
2          2               1.7 0.6 3.9  3
3          3               1.7 0.8 8.9  7

或者,如果您确实希望每个 line.number.in.df 具有唯一的列,那么您可以轻松地将它们作为单独的列存储在 out.

x <- sapply(1:length(vector), \(x) sapply(test, \(i) apply(i, 2, \(j) which.closest(j, vector[x]))))
x <- apply(x, 1, \(x) as.data.frame(table(x)))
x <- lapply(x, \(i) i[which.max(i[, 2]), ])
row_numbers_df <- as.numeric(matrix(do.call(rbind, x)[['x']]))
names(row_numbers_df) <- do.call(c, lapply(test, names))

row_numbers_df
vals <- array(0, dim = length(row_numbers_df))
for (i in 1:length(row_numbers_df)) { vals[i] <- do.call(cbind, test)[row_numbers_df[i], i] }

unique_number_of_cols <- unique(sapply(test, ncol))

out <- data.frame(
  seq_len(length(test)),
  split(row_numbers_df, names(row_numbers_df)),
  matrix(vals, ncol = unique_number_of_cols, byrow = TRUE)
)
column_names <- sapply(test, colnames)[, 1]
names(out) <- c('list.index',
                paste0('line.number.in.df.', column_names),
                column_names)

结果

> out
  list.index line.number.in.df.C1 line.number.in.df.C2 line.number.in.df.C3  C1  C2 C3
1          1                    3                    3                    3 0.5 3.7  5
2          2                    3                    1                    1 0.6 3.9  3
3          3                    3                    1                    1 0.8 8.9  7

这是一个dplyr方法。我们可以为每个数据帧生成 list.indexline.number.in.df,然后将它们一起生成 bind_rows。接下来,slice C2 包含该向量中每个数字的最接近值的行。

library(dplyr)

test <- list(structure(list(C1 = c(0.2, 0.4, 0.5), C2 = c(2, 3.5, 3.7
), C3 = c(0.3, 4, 5)), class = "data.frame", row.names = c(NA, 
-3L)), structure(list(C1 = c(0.1, 0.3, 0.6), C2 = c(3.9, 4.3, 
8), C3 = c(3, 5.2, 10)), class = "data.frame", row.names = c(NA, 
-3L)), structure(list(C1 = c(0.4, 0.55, 0.8), C2 = c(8.9, 10.3, 
14), C3 = c(7, 8.4, 11)), class = "data.frame", row.names = c(NA, 
-3L)))

vector <- c(3, 14.4, 7, 0)

test %>% 
  lapply(tibble::rowid_to_column, "line.number.in.df") %>% 
  bind_rows(.id = "list.index") %>% 
  slice(vapply(vector, \(x) which.min(abs(x - C2)), integer(1L)))

输出是

  list.index line.number.in.df  C1   C2   C3
1          1                 2 0.4  3.5  4.0
2          3                 3 0.8 14.0 11.0
3          2                 3 0.6  8.0 10.0
4          1                 1 0.2  2.0  0.3