将具有多个参数的函数应用于 R 中的多个成对变量

Question

我有一个像这样的功能，我用它来清理数据并且工作正常。

my_fun <- function (x, y){
    y <- ifelse(str_detect(x, "-*\d+\.*\d*"), 
        as.numeric(str_extract(x, "-*\d+\.*\d*")),
        as.numeric(y))
}

它将输入错误列中的数字重新分配到正确的列中。它用于清理 y 变量：

df$y <- my_fun(x, y)

我有很多 columns/variables（超过 10 个）以相同的格式配对，例如

x_vars <- c("x_1", "x_2", "x_3", "x_4", "x_5", "x_6")
y_vars <- c("y_1", "y_2", "y_3", "y_4", "y_5", "y_6")

我的问题是。有没有一种方法可以将此函数应用于我的数据集中需要以相同方式清理的所有变量？在我的数据清理函数只有一个参数使用 lapply 但在这种情况下我很挣扎的其他情况下，我可以轻松地做到这一点。

我已经尝试 mapply 但无法让它工作，这可能是因为我在 R 中仍然是一个新手。任何建议将不胜感激。

Answer 1

我们可以使用mapply/Map。我们需要根据列名提取列，方法是将 'x_vars'、'y_vars' 作为参数传递给 Map，在提取的 vector 上应用 my_fun s，并将其赋值回原始数据集

中的'y_vars'

df[y_vars] <- Map(function(x,y) my_fun(df[,x], df[,y]), x_vars, y_vars)

或者这也可以写成

df[y_vars] <- Map(my_fun, df[x_vars], df[y_vars])

注意：在这里，我们假设 'x_vars' 和 'y_vars' 中的所有元素都是原始数据集中的列。我们还要声明，使用 Map 比将其重塑为 long 然后进行一些转换更快、更有效。

为了提供不同的方法，我们可以使用 data.table

中的 melt

library(data.table)
dM <- melt(setDT(df), measure = list(x_vars, y_vars))[,
               value3 := my_fun(value1, value2), variable]

然后，我们需要再次dcast将其恢复为'wide'格式。所以，它需要更多的步骤，并不容易

setnames(dcast(dM, rowid(variable)~variable, 
  value.var = c("value1", "value3"))[,variable := NULL][], c(x_vars, y_vars))[]

数据

set.seed(24)
df <- as.data.frame(matrix(sample(c(1:5, "something 10.5",
   "this -4.5", "what -5.2 value?"),
          12*10, replace=TRUE), ncol=12, dimnames = 
     list(NULL, c(x_vars, y_vars))), stringsAsFactors=FALSE)

Answer 2

B/c 我一直认为知道如何在 base R 中做这些事情很好，我有如何使用 mapply() 和 lapply().[=15 的例子=]

## first generate some data
df <- data.frame(replicate(12, rnorm(5)))
my_fun <- function (x, y){
    ifelse(stringr::str_detect(x, "-*\d+\.*\d*"),
        as.numeric(stringr::str_extract(x, "-*\d+\.*\d*")),
        as.numeric(y))
}
df <- data.frame(replicate(12, rnorm(3)))
df[, sample(1:6, 3)] <- letters[1:3]
## not function of interest, but good mapply() example
names(df) <- c(
               mapply(paste0, rep("x_", 6), 1:6),
               mapply(paste0, rep("y_", 6), 1:6))

## print data with problem variables (cols with letters)
#df
#         x_1 x_2 x_3 x_4        x_5        x_6       y_1
#1 -0.2184993   a   a   a -0.1587070 0.37795630 0.6162796
#2  0.8511775   b   b   b  0.5743287 0.15291219 1.0594502
#3  0.8183208   c   c   c  1.8923812 0.07156925 0.8613535
#         y_2        y_3        y_4       y_5        y_6
#1  0.3240393 -1.1084067  0.5233168 0.3712705 -0.3911407
#2  0.3044824 -0.2286032 -1.0019870 1.2156441  0.4010163
#3 -1.0920677  1.3408504  1.3339865 0.3270800 -0.8416253



## if you wrote a for loop, it'd look like this maybe
out <- vector("list", 6)
for (i in seq_len(6)) {
    out[[i]] <- my_fun(df[, i], df[, i + 6])
}

## same construction can be used with lapply
dfy <- lapply(seq_len(6), function(i)
    my_fun(df[, 1:6][[i]],
           df[, 7:12][[i]]))
matrix(unlist(dfy), 5, 6)
#           [,1]       [,2]       [,3]        [,4]       [,5]
#[1,] -0.2184993 -1.0920677 -1.0019870  0.37795630  0.8183208
#[2,]  0.8511775 -1.1084067  1.3339865  0.15291219  0.3240393
#[3,]  0.8183208 -0.2286032 -0.1587070  0.07156925  0.3044824
#[4,]  0.3240393  1.3408504  0.5743287 -0.21849928 -1.0920677
#[5,]  0.3044824  0.5233168  1.8923812  0.85117750 -1.1084067
#           [,6]
#[1,] -0.2286032
#[2,]  1.3408504
#[3,]  0.5233168
#[4,] -1.0019870
#[5,]  1.3339865

Warning message: In matrix(unlist(dfy), 5, 6) : data length [18] is not a sub-multiple or multiple of the number of rows [5]

## and mapply makes this even easier
mapply(my_fun, df[, 1:6], df[, 7:12])
#            x_1        x_2        x_3        x_4        x_5
#[1,] -0.2184993  0.3240393 -1.1084067  0.5233168 -0.1587070
#[2,]  0.8511775  0.3044824 -0.2286032 -1.0019870  0.5743287
#[3,]  0.8183208 -1.0920677  1.3408504  1.3339865  1.8923812
#            x_6
#[1,] 0.37795630
#[2,] 0.15291219
#[3,] 0.07156925

将具有多个参数的函数应用于 R 中的多个成对变量

applying a function with multiple arguments over multiple paired variables in R

r

function

lapply

data-cleaning

mapply

数据