在 r 中重新编码范围多列
Recode range multiple columns in r
我找不到这个具体问题的答案。我想将多个字符列重新编码为数字列。 (是一百列)但是:
- 列不会总是以相同的顺序排列(我重新编码
每个月刷新的数据)。
- 列由我不想重新编码的列分隔。
- 数据集并不总是包含相同的列。
所以,我不认为我可以使用列索引范围。但是,我希望重新编码的列以相同的列名前缀开头。我想将任何 "Yes" 重新编码为 1,将 "No" 重新编码为 0,并将空白重新编码为 NA。
我可以使用以下代码一次一列手动执行此操作:
#Recode columns one at a time
library(car)
#skip ID column
#Skip Date column
df$Q1<-as.numeric(as.character(recode(df$Q1,"NA=NA; 'No'=0; 'Yes'=1; ''=NA")))
df$Q2<-as.numeric(as.character(recode(df$Q2,"NA=NA; 'No'=0; 'Yes'=1; ''=NA")))
#skip Q2.Explanation column
#do the above for a hundred more columns...
但我想同时重新编码一百个特定的列。此外,这些列由我不想重新编码的列分隔。
我的数据如下。不确定 dput 是什么:
ID<-c(01,02,03,04,05)
Q1<-c("Yes", NA,"", "No",NA)
Q1.Explanation<-c (NA, NA,"","Respondent did not get the correct answer", NA)
Q2<-c("No","Yes","Yes","", NA)
Q2.Explanation <-c("The right answer was not proven", NA, NA, NA, NA)
Q3<-c("", NA, "Yes", NA, NA)
Mydata<-as.data.frame(cbind(ID,Q1,Q1.Explanation, Q2, Q2.Explanation,Q3))
如果您知道要更改的列始终具有相同的名称,只是 table 中的不同位置,那么您可以对列名使用正则表达式进行子集化,然后更改apply()
.
列
your_data[, grep("Q", colnames(your_data))] <- as.data.frame(apply(your_data[, grep("Q", colnames(your_data))],
2,
function(x) recode(x, "NA = NA; 'No' = 0; 'Yes' = 1; '' = NA")))
这应该重新编码所有以 "Q" 开头的列,无论它们在任何给定月份的位置如何。
对于 data.table
粉丝我有另一个解决方案,它的优点是使用 factors
而不是整数进行重新编码,这样
数值的含义仍然正确显示(提高数据的可读性):
library(data.table)
ID<-c(01,02,03,04,05)
Q1<-c("Yes", NA,"", "No",NA)
Q1.Explanation<-c (NA, NA,"","Respondent did not get the correct answer", NA)
Q2<-c("No","Yes","Yes","", NA)
Q2.Explanation <-c("The right answer was not proven", NA, NA, NA, NA)
Q3<-c("", NA, "Yes", NA, NA)
Mydata<-as.data.frame(cbind(ID,Q1,Q1.Explanation, Q2, Q2.Explanation,Q3))
Mydata
# The solution starts here... ----------------------------------------------
setDT(Mydata) # convert data.frame into data.table
# the regular expression selects all column names starting with a "Q" followed by digits until the end
affected.cols <- colnames(Mydata)[grep("^Q\d+$", colnames(Mydata))]
# convert the columns to factors; trailing square brackets are only added to print the output
Mydata[, (affected.cols) := lapply(affected.cols, function(x) { .SD[, factor(get(x), c("No", "Yes")) ] })] []
str(Mydata) # Columns are encoded as factors ("enumerated types") now, which is an integer internally that has a string label
# Proof: 1 = "No", 2 = "Yes"; the "excluded" parameter of "factor()" caused all other values (mainly empty strings) to be translated into NAs
as.numeric(Mydata$Q1)
这导致:
> as.numeric(Mydata$Q1)
[1] 2 NA NA 1 NA
> Mydata
ID Q1 Q1.Explanation Q2 Q2.Explanation Q3
1: 1 Yes NA No The right answer was not proven NA
2: 2 NA NA Yes NA NA
3: 3 NA Yes NA Yes
4: 4 No Respondent did not get the correct answer NA NA NA
5: 5 NA NA NA NA NA
对数值的正确翻译是由于幸运的情况,请求的数值以 1 开头,因此 "No" 的级别索引为 1,"Yes" 的级别索引为 2。
我找不到这个具体问题的答案。我想将多个字符列重新编码为数字列。 (是一百列)但是:
- 列不会总是以相同的顺序排列(我重新编码 每个月刷新的数据)。
- 列由我不想重新编码的列分隔。
- 数据集并不总是包含相同的列。
所以,我不认为我可以使用列索引范围。但是,我希望重新编码的列以相同的列名前缀开头。我想将任何 "Yes" 重新编码为 1,将 "No" 重新编码为 0,并将空白重新编码为 NA。
我可以使用以下代码一次一列手动执行此操作:
#Recode columns one at a time
library(car)
#skip ID column
#Skip Date column
df$Q1<-as.numeric(as.character(recode(df$Q1,"NA=NA; 'No'=0; 'Yes'=1; ''=NA")))
df$Q2<-as.numeric(as.character(recode(df$Q2,"NA=NA; 'No'=0; 'Yes'=1; ''=NA")))
#skip Q2.Explanation column
#do the above for a hundred more columns...
但我想同时重新编码一百个特定的列。此外,这些列由我不想重新编码的列分隔。
我的数据如下。不确定 dput 是什么:
ID<-c(01,02,03,04,05)
Q1<-c("Yes", NA,"", "No",NA)
Q1.Explanation<-c (NA, NA,"","Respondent did not get the correct answer", NA)
Q2<-c("No","Yes","Yes","", NA)
Q2.Explanation <-c("The right answer was not proven", NA, NA, NA, NA)
Q3<-c("", NA, "Yes", NA, NA)
Mydata<-as.data.frame(cbind(ID,Q1,Q1.Explanation, Q2, Q2.Explanation,Q3))
如果您知道要更改的列始终具有相同的名称,只是 table 中的不同位置,那么您可以对列名使用正则表达式进行子集化,然后更改apply()
.
your_data[, grep("Q", colnames(your_data))] <- as.data.frame(apply(your_data[, grep("Q", colnames(your_data))],
2,
function(x) recode(x, "NA = NA; 'No' = 0; 'Yes' = 1; '' = NA")))
这应该重新编码所有以 "Q" 开头的列,无论它们在任何给定月份的位置如何。
对于 data.table
粉丝我有另一个解决方案,它的优点是使用 factors
而不是整数进行重新编码,这样
数值的含义仍然正确显示(提高数据的可读性):
library(data.table)
ID<-c(01,02,03,04,05)
Q1<-c("Yes", NA,"", "No",NA)
Q1.Explanation<-c (NA, NA,"","Respondent did not get the correct answer", NA)
Q2<-c("No","Yes","Yes","", NA)
Q2.Explanation <-c("The right answer was not proven", NA, NA, NA, NA)
Q3<-c("", NA, "Yes", NA, NA)
Mydata<-as.data.frame(cbind(ID,Q1,Q1.Explanation, Q2, Q2.Explanation,Q3))
Mydata
# The solution starts here... ----------------------------------------------
setDT(Mydata) # convert data.frame into data.table
# the regular expression selects all column names starting with a "Q" followed by digits until the end
affected.cols <- colnames(Mydata)[grep("^Q\d+$", colnames(Mydata))]
# convert the columns to factors; trailing square brackets are only added to print the output
Mydata[, (affected.cols) := lapply(affected.cols, function(x) { .SD[, factor(get(x), c("No", "Yes")) ] })] []
str(Mydata) # Columns are encoded as factors ("enumerated types") now, which is an integer internally that has a string label
# Proof: 1 = "No", 2 = "Yes"; the "excluded" parameter of "factor()" caused all other values (mainly empty strings) to be translated into NAs
as.numeric(Mydata$Q1)
这导致:
> as.numeric(Mydata$Q1)
[1] 2 NA NA 1 NA
> Mydata
ID Q1 Q1.Explanation Q2 Q2.Explanation Q3
1: 1 Yes NA No The right answer was not proven NA
2: 2 NA NA Yes NA NA
3: 3 NA Yes NA Yes
4: 4 No Respondent did not get the correct answer NA NA NA
5: 5 NA NA NA NA NA
对数值的正确翻译是由于幸运的情况,请求的数值以 1 开头,因此 "No" 的级别索引为 1,"Yes" 的级别索引为 2。