R:如何使用行数较少的另一个数据框中的信息来定位大型数据框中的列

R: How to locate a column in a large dataframe by using information from another dataframe with less rows

我有一个数据框 (A),其中有一列包含一些信息。我有一个更大的数据框 (B),其中包含一个具有相似信息的列,我需要检测哪一列包含与 dataframeA 中的列相同的数据。由于 dataframeB 较大,手动翻阅以识别列会很耗时。有没有一种方法可以使用 DataframeA 中 'some_info' 列的信息在 DataframeB 中找到包含该信息的相应列?


dataframeA <- data.frame(some_info = c("a","b","c","d","e") )

dataframeB <- data.frame(id = 1:8, column_to_be_identified = c("a","f","b","c","g", "d","h", "e"), "column_almost_similar_but_not_quite" =c("a","f","b","c","g", "3","h", "e")  )

基本上:是否可以创建一个函数或类似的东西来查看 dataframeB 并检测完全包含 dataframeA 中列信息的列?

提前致谢!

如果我没理解错而你只是想收到列名:

dataframeA <- data.frame(some_info = as.POSIXct(Sys.Date() - 1:5))
dataframeA
#>             some_info
#> 1 2021-09-16 02:00:00
#> 2 2021-09-15 02:00:00
#> 3 2021-09-14 02:00:00
#> 4 2021-09-13 02:00:00
#> 5 2021-09-12 02:00:00
class(dataframeA$some_info)
#> [1] "POSIXct" "POSIXt"
dataframeB <- data.frame(id = 1:10, 
                         column_to_be_identified = as.POSIXct(Sys.Date() - 1:10),
                         column_almost_similar_but_not_quite = as.POSIXct(Sys.Date() - 6:15) )
dataframeB
#>    id column_to_be_identified column_almost_similar_but_not_quite
#> 1   1     2021-09-16 02:00:00                 2021-09-11 02:00:00
#> 2   2     2021-09-15 02:00:00                 2021-09-10 02:00:00
#> 3   3     2021-09-14 02:00:00                 2021-09-09 02:00:00
#> 4   4     2021-09-13 02:00:00                 2021-09-08 02:00:00
#> 5   5     2021-09-12 02:00:00                 2021-09-07 02:00:00
#> 6   6     2021-09-11 02:00:00                 2021-09-06 02:00:00
#> 7   7     2021-09-10 02:00:00                 2021-09-05 02:00:00
#> 8   8     2021-09-09 02:00:00                 2021-09-04 02:00:00
#> 9   9     2021-09-08 02:00:00                 2021-09-03 02:00:00
#> 10 10     2021-09-07 02:00:00                 2021-09-02 02:00:00

relevant_column_name <- names(
  which(
    # iterate over all columns
    sapply(dataframeB, function(x) {
      # unique is more efficient for large vectors
      x <- unique(x)
      # are all values of the target vector in the column
      all(dataframeA$some_info %in% x)
    })))

relevant_column_name
#> [1] "column_to_be_identified"

有了 dplyrselect 我们可以做到这一点

library(dplyr)
dataframeB %>% 
   select(where(~ is.character(.) && 
           all(dataframeA$some_info %in% .))) %>%
   names
[1] "column_to_be_identified"