在 r 中重新编码范围多列

Recode range multiple columns in r

我找不到这个具体问题的答案。我想将多个字符列重新编码为数字列。 (是一百列)但是:

所以,我不认为我可以使用列索引范围。但是,我希望重新编码的列以相同的列名前缀开头。我想将任何 "Yes" 重新编码为 1,将 "No" 重新编码为 0,并将空白重新编码为 NA。

我可以使用以下代码一次一列手动执行此操作:

    #Recode columns one at a time

    library(car)
    #skip ID column
    #Skip Date column
    df$Q1<-as.numeric(as.character(recode(df$Q1,"NA=NA; 'No'=0; 'Yes'=1; ''=NA")))
    df$Q2<-as.numeric(as.character(recode(df$Q2,"NA=NA; 'No'=0; 'Yes'=1; ''=NA")))
    #skip Q2.Explanation column
    #do the above for a hundred more columns...

但我想同时重新编码一百个特定的列。此外,这些列由我不想重新编码的列分隔。

我的数据如下。不确定 dput 是什么:

    ID<-c(01,02,03,04,05)
    Q1<-c("Yes", NA,"", "No",NA)
    Q1.Explanation<-c (NA, NA,"","Respondent did not get the correct answer", NA)
    Q2<-c("No","Yes","Yes","", NA)
    Q2.Explanation <-c("The right answer was not proven", NA, NA, NA, NA)
    Q3<-c("", NA, "Yes", NA, NA)
    Mydata<-as.data.frame(cbind(ID,Q1,Q1.Explanation, Q2, Q2.Explanation,Q3))

如果您知道要更改的列始终具有相同的名称,只是 table 中的不同位置,那么您可以对列名使用正则表达式进行子集化,然后更改apply().

your_data[, grep("Q", colnames(your_data))] <- as.data.frame(apply(your_data[, grep("Q", colnames(your_data))], 
                               2, 
                               function(x) recode(x, "NA = NA; 'No' = 0; 'Yes' = 1; '' = NA")))

这应该重新编码所有以 "Q" 开头的列,无论它们在任何给定月份的位置如何。

对于 data.table 粉丝我有另一个解决方案,它的优点是使用 factors 而不是整数进行重新编码,这样 数值的含义仍然正确显示(提高数据的可读性):

library(data.table)

ID<-c(01,02,03,04,05)
Q1<-c("Yes", NA,"", "No",NA)
Q1.Explanation<-c (NA, NA,"","Respondent did not get the correct answer", NA)
Q2<-c("No","Yes","Yes","", NA)
Q2.Explanation <-c("The right answer was not proven", NA, NA, NA, NA)
Q3<-c("", NA, "Yes", NA, NA)
Mydata<-as.data.frame(cbind(ID,Q1,Q1.Explanation, Q2, Q2.Explanation,Q3))

Mydata

# The solution starts here... ----------------------------------------------

setDT(Mydata)     # convert data.frame into data.table

# the regular expression selects all column names starting with a "Q" followed by digits until the end
affected.cols <- colnames(Mydata)[grep("^Q\d+$", colnames(Mydata))]

# convert the columns to factors; trailing square brackets are only added to print the output
Mydata[, (affected.cols) := lapply(affected.cols, function(x) { .SD[, factor(get(x), c("No", "Yes")) ] })] []

str(Mydata)           # Columns are encoded as factors ("enumerated types") now, which is an integer internally that has a string label

# Proof: 1 = "No", 2 = "Yes"; the "excluded" parameter of "factor()" caused all other values (mainly empty strings) to be translated into NAs
as.numeric(Mydata$Q1)

这导致:

> as.numeric(Mydata$Q1)
[1]  2 NA NA  1 NA


> Mydata
   ID  Q1                            Q1.Explanation  Q2                  Q2.Explanation  Q3
1:  1 Yes                                        NA  No The right answer was not proven  NA
2:  2  NA                                        NA Yes                              NA  NA
3:  3  NA                                           Yes                              NA Yes
4:  4  No Respondent did not get the correct answer  NA                              NA  NA
5:  5  NA                                        NA  NA                              NA  NA

对数值的正确翻译是由于幸运的情况,请求的数值以 1 开头,因此 "No" 的级别索引为 1,"Yes" 的级别索引为 2。