插入符号 rfe() 错误 "there should be the same number of samples in x and y"

Question

我在解决错误“x 和 y 中应该有相同数量的样本”时遇到困难。我注意到其他人已在此站点上发布了有关此错误的信息，但他们的解决方案对我不起作用。我在这里附上我的数据集的缩写版本。

x_train 在这里：

x_train <- structure(list(laterality = c("Left", "Right", "Right", "Right", 
"Left", "Left", "Left", "Left", "Left", "Right"), age = c(66L, 
56L, 69L, 49L, 60L, 70L, 58L, 53L, 59L, 64L), insurance = c("MEDICARE", 
"UNITED", "MEDICARE", "UNITED", "COMMERCIAL", "MEDICARE", "AETNA", 
"AETNA", "OXFORD", "MEDICARE_MANAGED"), employment = c("Retired", 
"FullTime", "Retired", "FullTime", "Disabled", "SelfEmployed", 
"Retired", "FullTime", "FullTime", "Disabled"), sex = c("Female", 
"Male", "Female", "Female", "Female", "Female", "Male", "Male", 
"Female", "Male"), race = c("WhiteorCaucasian", "WhiteorCaucasian", 
"WhiteorCaucasian", "WhiteorCaucasian", "WhiteorCaucasian", "WhiteorCaucasian", 
"Other", "BlackorAfricanAmerican", "WhiteorCaucasian", "WhiteorCaucasian"
), ethnicity = c("NotHispanicorLatino", "NotHispanicorLatino", 
"NotHispanicorLatino", "NotHispanicorLatino", "NotHispanicorLatino", 
"NotHispanicorLatino", "NotHispanicorLatino", "NotHispanicorLatino", 
"NotHispanicorLatino", "NotHispanicorLatino"), bmi = c(22.3, 
33, 34.3, 36, 30, 20, 29.5, 33.4, 26.5, 34.2), PreferredLanguage = c("English", 
"English", "English", "English", "English", "English", "English", 
"English", "English", "English"), married = c("Married", "Married", 
"Married", "Married", "Married", "Married", "Divorced", "Single", 
"Married", "Married"), RadiographSevere = c("No", "No", "No", 
"No", "No", "No", "No", "No", "No", "No"), HxAnxietyDepression = c("No", 
"No", "No", "Yes", "Yes", "Yes", "No", "No", "No", "No"), SurgeryYear = c(2017L, 
2017L, 2017L, 2017L, 2017L, 2017L, 2017L, 2017L, 2017L, 2017L
), operativetime = c(82L, 79L, 85L, 76L, 84L, 86L, 67L, 75L, 
72L, 100L), HipApproach = c("Anterior", "Posterior", "Posterior", 
"Posterior", "Posterior", "Anterior", "Posterior", "Posterior", 
"Posterior", "Posterior")), row.names = c(NA, -10L), class = c("data.table", 
"data.frame"))

y_train 在这里：


y_train <- structure(list(POD1AverageNrsScoreCut = c("[0,5)", "[0,5)", "[0,5)", 
                                          "[0,5)", "[5,10)", "[0,5)", "[0,5)", "[5,10)", "[0,5)", "[0,5)"
)), row.names = c(NA, -10L), class = c("data.table", "data.frame"
))

我用于 rfe 的代码在这里：

library(caret)
control <- rfeControl(functions = rfFuncs, # random forest
                      method = "repeatedcv", # repeated cv
                      repeats = 3, # number of repeats
                      number = 10) # number of folds

result_rfe <- rfe(x = x_train, y = y_train, sizes = c(1:30), rfeControl = control)

Answer 1

我看到你的输出是两个类的极限区间。也许如果您尝试将它们作为因素 y = as.factor(unlist(y_train))？它对我有用

control <- rfeControl(functions = rfFuncs, # random forest
                      method = "repeatedcv", # repeated cv
                      repeats = 3, # number of repeats
                      number = 10) # number of folds

result_rfe <- rfe(x = x_train, y = as.factor(unlist(y_train)), sizes = c(1:30), rfeControl = control)

输出：

>result_rfe
    
    Recursive feature selection

Outer resampling method: Cross-Validated (10 fold, repeated 3 times) 

Resampling performance over subset size:

 Variables Accuracy Kappa AccuracySD KappaSD Selected
         1  0.06667     0     0.2537       0         
         2  0.06667     0     0.2537       0         
         3  0.30000     0     0.4661       0         
         4  0.20000     0     0.4068       0         
         5  0.36667     0     0.4901       0         
         6  0.40000     0     0.4983       0         
         7  0.43333     0     0.5040       0         
         8  0.53333     0     0.5074       0        *
         9  0.30000     0     0.4661       0         
        10  0.33333     0     0.4795       0         
        11  0.20000     0     0.4068       0         
        12  0.26667     0     0.4498       0         
        13  0.06667     0     0.2537       0         
        14  0.13333     0     0.3457       0         
        15  0.20000     0     0.4068       0         

The top 5 variables (out of 8):
   insurance, laterality, HipApproach, employment, ethnicity

注意：我不知道这是否是你所期望的，我不知道数据上下文和你的方法。

原回答： Subscript out of bounds error in caret's rfe function

插入符号 rfe() 错误 "there should be the same number of samples in x and y"

Caret rfe() error "there should be the same number of samples in x and y"

numbers

r

sample

caret

rfe