如何将函数应用于 R 中数据框中的特定列集以替换 NA
How to Apply functions to specific set of columns in data frame in R to replace NAs
我有一个数据集,我想在其中以不同方式替换不同列中的 NA。以下是虚拟数据集和复制它的代码。
test <- data.frame(ID = c(1:5),
FirstName = c(NA,"Sid",NA,"Harsh","CJ"),
LastName = c("Snow",NA,"Lapata","Khan",NA),
BillNum = c(6:10),
Phone = c(1213,3123,3123,NA,NA),
Married = c("Yes","Yes",NA,"NO","Yes"),
ZIP = c(1111,2222,333,444,555),
Gender = c("M",NA,"F",NA,"M"),
Address = c("A","B",NA,"C","D"))
> test
ID FirstName LastName BillNum Phone Married ZIP Gender Address
1 1 <NA> Snow 6 1213 Yes 1111 M A
2 2 Sid <NA> 7 3123 Yes 2222 <NA> B
3 3 <NA> Lapata 8 3123 <NA> 333 F <NA>
4 4 Harsh Khan 9 NA NO 444 <NA> C
5 5 CJ <NA> 10 NA Yes 555 M D
在某些专栏中,我想指出一个值是否由客户提供,而不保留提供的值,如下所示。
Availability_Indicator <- function(x){
x <- ifelse(is.na(x),"NotAvialable","Available")
return(x)
}
test$FirstName <- Availability_Indicator(test$FirstName)
test$LastName <- Availability_Indicator(test$LastName)
test$Phone <- Availability_Indicator(test$Phone)
test$Address <- Availability_Indicator(test$Address)
我得到以下数据
> test
ID FirstName LastName BillNum Phone Married ZIP Gender
1 NotAvialable Available 6 Available Yes 1111 M
2 Available NotAvialable 7 Available Yes 2222 <NA>
3 NotAvialable Available 8 Available <NA> 333 F
4 Available Available 9 NotAvialable NO 444 <NA>
5 Available NotAvialable 10 NotAvialable Yes 555 M
Address
Available
Available
NotAvialable
Available
Available
在 married 和 gender 变量中,我不想丢失列的值,只需将 NA 替换如下。
NotAvailable_Indicator <- function(x){
x[is.na(x)]<-"NotAvailable"
return(x)
}
test$Married <- NotAvailable_Indicator(test$Married)
test$Gender <- NotAvailable_Indicator(test$Gender)
得到如下数据集
ID FirstName LastName BillNum Phone Married ZIP Gender Address
1 NotAvialable Available 6 Available Yes 1111 M Available
2 Available NotAvialable 7 Available Yes 2222 NotAvailable Available
3 NotAvialable Available 8 Available NotAvailable 333 F NotAvialable
4 Available Available 9 NotAvialable NO 444 NotAvailable Available
5 Available NotAvialable 10 NotAvialable Yes 555 M Available
我的问题是我不想分别为每一列重复函数调用,因为我有大约 200 列。我无法使用应用函数,因为我必须对数据进行子集化,然后使用 lapply 应用函数,然后再次 cbind 到更改列顺序的原始数据。有什么方法可以提供列名和函数名,并且我可以在 return 中将修改后的列与其他列(未更改)一起作为数据集,或者在没有 [=30= 的情况下就地修改列]ing 任何东西(比如 python 中的 DataFrame.fillna 有参数 inplace=logical)
我们可以使用 tidyverse
来做到这一点
library(dplyr)
#specify the columns of interest
#if there are any patterns, we can use `matches` or `grep`
nm1 <- names(test)[c(2, 3, 5, 9)]
nm2 <- names(test)[c(6, 8)]
#use `mutate_at` by specifying the arguments 'vars' and 'funs'
test %>%
mutate_at(vars(one_of(nm1)), funs(Availability_Indicator)) %>%
mutate_at(vars(one_of(nm2)), funs(NotAvailable_Indicator))
#ID FirstName LastName BillNum Phone Married ZIP Gender Address
#1 1 NotAvialable Available 6 Available Yes 1111 M Available
#2 2 Available NotAvialable 7 Available Yes 2222 NotAvailable Available
#3 3 NotAvialable Available 8 Available NotAvailable 333 F NotAvialable
#4 4 Available Available 9 NotAvialable NO 444 NotAvailable Available
#5 5 Available NotAvialable 10 NotAvialable Yes 555 M Available
一个base R
选项是使用lapply
遍历列,应用函数并更新数据集列
test[nm1] <- lapply(test[nm1], Availability_Indicator)
test[nm2] <- lapply(test[nm2], NotAvailable_Indicator)
数据
与 factor
class 列相比,更改 character
的值更容易。因此,在 'data.frame' 调用中使用 stringsAsFActors=FALSE
,非数字列将是 character
class
test <- data.frame(ID = c(1:5),
FirstName = c(NA,"Sid",NA,"Harsh","CJ"),
LastName = c("Snow",NA,"Lapata","Khan",NA),
BillNum = c(6:10),
Phone = c(1213,3123,3123,NA,NA),
Married = c("Yes","Yes",NA,"NO","Yes"),
ZIP = c(1111,2222,333,444,555),
Gender = c("M",NA,"F",NA,"M"),
Address = c("A","B",NA,"C","D"), stringsAsFactors=FALSE)
我有一个数据集,我想在其中以不同方式替换不同列中的 NA。以下是虚拟数据集和复制它的代码。
test <- data.frame(ID = c(1:5),
FirstName = c(NA,"Sid",NA,"Harsh","CJ"),
LastName = c("Snow",NA,"Lapata","Khan",NA),
BillNum = c(6:10),
Phone = c(1213,3123,3123,NA,NA),
Married = c("Yes","Yes",NA,"NO","Yes"),
ZIP = c(1111,2222,333,444,555),
Gender = c("M",NA,"F",NA,"M"),
Address = c("A","B",NA,"C","D"))
> test
ID FirstName LastName BillNum Phone Married ZIP Gender Address
1 1 <NA> Snow 6 1213 Yes 1111 M A
2 2 Sid <NA> 7 3123 Yes 2222 <NA> B
3 3 <NA> Lapata 8 3123 <NA> 333 F <NA>
4 4 Harsh Khan 9 NA NO 444 <NA> C
5 5 CJ <NA> 10 NA Yes 555 M D
在某些专栏中,我想指出一个值是否由客户提供,而不保留提供的值,如下所示。
Availability_Indicator <- function(x){
x <- ifelse(is.na(x),"NotAvialable","Available")
return(x)
}
test$FirstName <- Availability_Indicator(test$FirstName)
test$LastName <- Availability_Indicator(test$LastName)
test$Phone <- Availability_Indicator(test$Phone)
test$Address <- Availability_Indicator(test$Address)
我得到以下数据
> test
ID FirstName LastName BillNum Phone Married ZIP Gender
1 NotAvialable Available 6 Available Yes 1111 M
2 Available NotAvialable 7 Available Yes 2222 <NA>
3 NotAvialable Available 8 Available <NA> 333 F
4 Available Available 9 NotAvialable NO 444 <NA>
5 Available NotAvialable 10 NotAvialable Yes 555 M
Address
Available
Available
NotAvialable
Available
Available
在 married 和 gender 变量中,我不想丢失列的值,只需将 NA 替换如下。
NotAvailable_Indicator <- function(x){
x[is.na(x)]<-"NotAvailable"
return(x)
}
test$Married <- NotAvailable_Indicator(test$Married)
test$Gender <- NotAvailable_Indicator(test$Gender)
得到如下数据集
ID FirstName LastName BillNum Phone Married ZIP Gender Address
1 NotAvialable Available 6 Available Yes 1111 M Available
2 Available NotAvialable 7 Available Yes 2222 NotAvailable Available
3 NotAvialable Available 8 Available NotAvailable 333 F NotAvialable
4 Available Available 9 NotAvialable NO 444 NotAvailable Available
5 Available NotAvialable 10 NotAvialable Yes 555 M Available
我的问题是我不想分别为每一列重复函数调用,因为我有大约 200 列。我无法使用应用函数,因为我必须对数据进行子集化,然后使用 lapply 应用函数,然后再次 cbind 到更改列顺序的原始数据。有什么方法可以提供列名和函数名,并且我可以在 return 中将修改后的列与其他列(未更改)一起作为数据集,或者在没有 [=30= 的情况下就地修改列]ing 任何东西(比如 python 中的 DataFrame.fillna 有参数 inplace=logical)
我们可以使用 tidyverse
来做到这一点
library(dplyr)
#specify the columns of interest
#if there are any patterns, we can use `matches` or `grep`
nm1 <- names(test)[c(2, 3, 5, 9)]
nm2 <- names(test)[c(6, 8)]
#use `mutate_at` by specifying the arguments 'vars' and 'funs'
test %>%
mutate_at(vars(one_of(nm1)), funs(Availability_Indicator)) %>%
mutate_at(vars(one_of(nm2)), funs(NotAvailable_Indicator))
#ID FirstName LastName BillNum Phone Married ZIP Gender Address
#1 1 NotAvialable Available 6 Available Yes 1111 M Available
#2 2 Available NotAvialable 7 Available Yes 2222 NotAvailable Available
#3 3 NotAvialable Available 8 Available NotAvailable 333 F NotAvialable
#4 4 Available Available 9 NotAvialable NO 444 NotAvailable Available
#5 5 Available NotAvialable 10 NotAvialable Yes 555 M Available
一个base R
选项是使用lapply
遍历列,应用函数并更新数据集列
test[nm1] <- lapply(test[nm1], Availability_Indicator)
test[nm2] <- lapply(test[nm2], NotAvailable_Indicator)
数据
与 factor
class 列相比,更改 character
的值更容易。因此,在 'data.frame' 调用中使用 stringsAsFActors=FALSE
,非数字列将是 character
class
test <- data.frame(ID = c(1:5),
FirstName = c(NA,"Sid",NA,"Harsh","CJ"),
LastName = c("Snow",NA,"Lapata","Khan",NA),
BillNum = c(6:10),
Phone = c(1213,3123,3123,NA,NA),
Married = c("Yes","Yes",NA,"NO","Yes"),
ZIP = c(1111,2222,333,444,555),
Gender = c("M",NA,"F",NA,"M"),
Address = c("A","B",NA,"C","D"), stringsAsFactors=FALSE)