如何使用 'knnImpute' 修复 R(插入符号)中的 "Error in dimnames(x) <- dn : length of 'dimnames' [2] not equal to array extent"
How to fix "Error in dimnames(x) <- dn : length of 'dimnames' [2] not equal to array extent" in R (caret) with 'knnImpute'
我是 caret 包的新手(通常是使用 r 和 caret 进行机器学习)。我使用来自西雅图的公开可用数据集,我想从中预测未来传入请求的 class(通过 class 化)。
首先,我对我的数据集进行了 80/20 拆分。数据中有一些 NA,我想通过使用插入符号的 knnImpute 功能来估算它们。经过一段时间的运行后,我收到以下错误消息:
Error in dimnames(x) <- dn :
length of 'dimnames' [2] not equal to array extent
我做错了什么,我该如何解决?
关于此错误的帖子较多。不幸的是,我没有找到合适的解决方案来帮助我解决问题...
我的数据集 (v1.0) 如下所示:
> dataset %>% str()
Classes ‘spec_tbl_df’, ‘tbl_df’, ‘tbl’ and 'data.frame': 170657 obs. of 9 variables:
$ request_type : Factor w/ 29 levels "Abandoned_Vehicle",..: 10 10 10 10 10 10 10 10 10 10 ...
$ city_department: Factor w/ 8 levels "Center","City_Light",..: 3 3 3 3 3 3 3 3 3 3 ...
$ neighborhood : Factor w/ 91 levels "Adams","Alki",..: 1 1 4 4 10 13 21 21 21 24 ...
$ weekday : Ord.factor w/ 7 levels "So"<"Mo"<"Di"<..: 5 2 2 5 1 3 6 4 4 2 ...
$ month : Ord.factor w/ 12 levels "Jän"<"Feb"<"Mär"<..: 4 6 1 3 4 3 2 4 7 5 ...
$ cal_week : num 15 23 2 10 17 10 6 16 29 21 ...
$ holiday : Factor w/ 2 levels "noholiday","holiday": 1 1 1 1 1 1 1 1 1 1 ...
$ businessday : Factor w/ 2 levels "businessday",..: 1 1 1 1 2 1 1 1 1 1 ...
$ goodfriday : Factor w/ 2 levels "nogoodfriday",..: 1 1 1 1 1 1 1 1 1 1 ...
> dataset %>% skim()
Skim summary statistics
n obs: 170657
n variables: 9
── Variable type:factor ───────────────────────────────────────────────────────────────────────────────────────
variable missing complete n n_unique top_counts ordered
businessday 0 170657 170657 2 bus: 136087, nob: 34570, NA: 0 FALSE
city_department 0 170657 170657 8 Pol: 54916, Pub: 38171, Dep: 34712, Fin: 25471 FALSE
goodfriday 0 170657 170657 2 nog: 170140, goo: 517, NA: 0 FALSE
holiday 0 170657 170657 2 noh: 167514, hol: 3143, NA: 0 FALSE
month 0 170657 170657 12 Aug: 15247, Okt: 14807, Sep: 14785, Mär: 14781 TRUE
neighborhood 6447 164210 170657 91 NA: 6447, Bro: 4975, Uni: 3941, Wal: 3919 FALSE
request_type 0 170657 170657 29 Aba: 34478, Cus: 22275, Ill: 22033, Par: 16521 FALSE
weekday 0 170657 170657 7 Di: 28972, Mi: 28734, Mo: 28721, Do: 27298 TRUE
── Variable type:numeric ──────────────────────────────────────────────────────────────────────────────────────
variable missing complete n mean sd p0 p25 p50 p75 p100 hist
cal_week 0 170657 170657 26.52 14.78 1 14 27 39 53 ▇▇▇▇▇▇▆▆
我的拆分代码:
set.seed(100)
split <- createDataPartition(dataset$request_type, p=0.8, list=FALSE)
train <- dataset[split,]
train_x = train[, 2:8]
train_y = train$request_type
test <- dataset[-split,]
test_x = test[, 2:8]
test_y = test$request_type
我的归集代码:
model.preprocessed.imputed <- preProcess(train, method='knnImpute')
model.preprocessed.imputed
train <- predict(model.preprocessed.imputed, newdata = train)
Wenn 运行 预测,我收到错误信息
Error in dimnames(x) <- dn :
length of 'dimnames' [2] not equal to array extent
从回溯中我得到以下信息:
Error in dimnames(x) <- dn : length of 'dimnames' [2] not equal to array extent
3. `colnames<-`(`*tmp*`, value = miss_names)
2. predict.preProcess(PreProcess.MissingDatamodel, newdata = train)
1. predict(PreProcess.MissingDatamodel, newdata = train)
2019 年 4 月 2 日更新
我的数据集的第一个版本 (v1.0) 向我展示了一个混合 class:
> dataset %>% str()
Classes ‘spec_tbl_df’, ‘tbl_df’, ‘tbl’ and 'data.frame': 170657 obs. of 9 variables:
因为我发现一些帖子表明插入符可能对小标题有奇怪的反应,所以我尝试将我的数据集转换为通用数据框 (v1.1):
dataset <- as.data.frame(dataset)
dataset %>% str()
'data.frame': 170657 obs. of 9 variables:
$ request_type : Factor w/ 29 levels "Abandoned.Vehicle",..: 10 10 10 10 10 10 10 10 10 10 ...
$ city_department: Factor w/ 8 levels "Center","City.Light",..: 3 3 3 3 3 3 3 3 3 3 ...
$ neighborhood : Factor w/ 91 levels "Adams","Alki",..: 1 1 4 4 10 13 21 21 21 24 ...
$ weekday : Ord.factor w/ 7 levels "So"<"Mo"<"Di"<..: 5 2 2 5 1 3 6 4 4 2 ...
$ month : Ord.factor w/ 12 levels "Jän"<"Feb"<"Mär"<..: 4 6 1 3 4 3 2 4 7 5 ...
$ cal_week : num 15 23 2 10 17 10 6 16 29 21 ...
$ holiday : Factor w/ 2 levels "noholiday","holiday": 1 1 1 1 1 1 1 1 1 1 ...
$ businessday : Factor w/ 2 levels "businessday",..: 1 1 1 1 2 1 1 1 1 1 ...
$ goodfriday : Factor w/ 2 levels "nogoodfriday",..: 1 1 1 1 1 1 1 1 1 1 ...
dataset %>% skim()
Skim summary statistics
n obs: 170657
n variables: 9
── Variable type:factor ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
variable missing complete n n_unique top_counts ordered
businessday 0 170657 170657 2 bus: 136087, nob: 34570, NA: 0 FALSE
city_department 0 170657 170657 8 Pol: 54916, Pub: 38171, Dep: 34712, Fin: 25471 FALSE
goodfriday 0 170657 170657 2 nog: 170140, goo: 517, NA: 0 FALSE
holiday 0 170657 170657 2 noh: 167514, hol: 3143, NA: 0 FALSE
month 0 170657 170657 12 Aug: 15247, Okt: 14807, Sep: 14785, Mär: 14781 TRUE
neighborhood 6447 164210 170657 91 NA: 6447, Bro: 4975, Uni: 3941, Wal: 3919 FALSE
request_type 0 170657 170657 29 Aba: 34478, Cus: 22275, Ill: 22033, Par: 16521 FALSE
weekday 0 170657 170657 7 Di: 28972, Mi: 28734, Mo: 28721, Do: 27298 TRUE
── Variable type:numeric ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
variable missing complete n mean sd p0 p25 p50 p75 p100 hist
cal_week 0 170657 170657 26.52 14.78 1 14 27 39 53 ▇▇▇▇▇▇▆▆
虽然只是 class data.frame,但它并没有解决我的问题。
我想我找到了问题的根源:
我最初使用 tidyverse 的 readr::read_csv(),它以某种方式给了我一个具有奇怪行为的数据对象(正如评论中也指出的误用 - 感谢您的输入):
dataset <- read_csv("data/DataSet.csv") %>% clean_names()
使用 read.csv() 后,我的数据集中不再有 NA,插入符号的所有函数突然都适用于我的数据:
dataset <- read.csv("data/DataSet.csv", stringsAsFactors = FALSE) %>% clean_names()
也许这个发现对其他人也有帮助,因为我浪费了大量时间寻找由错误数据集对象导致的错误消息。
更新
现在我知道为什么没有北美的anmymore了。我发现 read.csv() 读取 NA 但使它们成为空字符串 ("") 而 read_csv() 明确使它们成为 NA。我也只是将 NA 转换为一个因子 ("missing"),因此我不必删除数据并冒着丢失信息的风险。
我是 caret 包的新手(通常是使用 r 和 caret 进行机器学习)。我使用来自西雅图的公开可用数据集,我想从中预测未来传入请求的 class(通过 class 化)。
首先,我对我的数据集进行了 80/20 拆分。数据中有一些 NA,我想通过使用插入符号的 knnImpute 功能来估算它们。经过一段时间的运行后,我收到以下错误消息:
Error in dimnames(x) <- dn :
length of 'dimnames' [2] not equal to array extent
我做错了什么,我该如何解决?
关于此错误的帖子较多。不幸的是,我没有找到合适的解决方案来帮助我解决问题...
我的数据集 (v1.0) 如下所示:
> dataset %>% str()
Classes ‘spec_tbl_df’, ‘tbl_df’, ‘tbl’ and 'data.frame': 170657 obs. of 9 variables:
$ request_type : Factor w/ 29 levels "Abandoned_Vehicle",..: 10 10 10 10 10 10 10 10 10 10 ...
$ city_department: Factor w/ 8 levels "Center","City_Light",..: 3 3 3 3 3 3 3 3 3 3 ...
$ neighborhood : Factor w/ 91 levels "Adams","Alki",..: 1 1 4 4 10 13 21 21 21 24 ...
$ weekday : Ord.factor w/ 7 levels "So"<"Mo"<"Di"<..: 5 2 2 5 1 3 6 4 4 2 ...
$ month : Ord.factor w/ 12 levels "Jän"<"Feb"<"Mär"<..: 4 6 1 3 4 3 2 4 7 5 ...
$ cal_week : num 15 23 2 10 17 10 6 16 29 21 ...
$ holiday : Factor w/ 2 levels "noholiday","holiday": 1 1 1 1 1 1 1 1 1 1 ...
$ businessday : Factor w/ 2 levels "businessday",..: 1 1 1 1 2 1 1 1 1 1 ...
$ goodfriday : Factor w/ 2 levels "nogoodfriday",..: 1 1 1 1 1 1 1 1 1 1 ...
> dataset %>% skim()
Skim summary statistics
n obs: 170657
n variables: 9
── Variable type:factor ───────────────────────────────────────────────────────────────────────────────────────
variable missing complete n n_unique top_counts ordered
businessday 0 170657 170657 2 bus: 136087, nob: 34570, NA: 0 FALSE
city_department 0 170657 170657 8 Pol: 54916, Pub: 38171, Dep: 34712, Fin: 25471 FALSE
goodfriday 0 170657 170657 2 nog: 170140, goo: 517, NA: 0 FALSE
holiday 0 170657 170657 2 noh: 167514, hol: 3143, NA: 0 FALSE
month 0 170657 170657 12 Aug: 15247, Okt: 14807, Sep: 14785, Mär: 14781 TRUE
neighborhood 6447 164210 170657 91 NA: 6447, Bro: 4975, Uni: 3941, Wal: 3919 FALSE
request_type 0 170657 170657 29 Aba: 34478, Cus: 22275, Ill: 22033, Par: 16521 FALSE
weekday 0 170657 170657 7 Di: 28972, Mi: 28734, Mo: 28721, Do: 27298 TRUE
── Variable type:numeric ──────────────────────────────────────────────────────────────────────────────────────
variable missing complete n mean sd p0 p25 p50 p75 p100 hist
cal_week 0 170657 170657 26.52 14.78 1 14 27 39 53 ▇▇▇▇▇▇▆▆
我的拆分代码:
set.seed(100)
split <- createDataPartition(dataset$request_type, p=0.8, list=FALSE)
train <- dataset[split,]
train_x = train[, 2:8]
train_y = train$request_type
test <- dataset[-split,]
test_x = test[, 2:8]
test_y = test$request_type
我的归集代码:
model.preprocessed.imputed <- preProcess(train, method='knnImpute')
model.preprocessed.imputed
train <- predict(model.preprocessed.imputed, newdata = train)
Wenn 运行 预测,我收到错误信息
Error in dimnames(x) <- dn :
length of 'dimnames' [2] not equal to array extent
从回溯中我得到以下信息:
Error in dimnames(x) <- dn : length of 'dimnames' [2] not equal to array extent
3. `colnames<-`(`*tmp*`, value = miss_names)
2. predict.preProcess(PreProcess.MissingDatamodel, newdata = train)
1. predict(PreProcess.MissingDatamodel, newdata = train)
2019 年 4 月 2 日更新
我的数据集的第一个版本 (v1.0) 向我展示了一个混合 class:
> dataset %>% str()
Classes ‘spec_tbl_df’, ‘tbl_df’, ‘tbl’ and 'data.frame': 170657 obs. of 9 variables:
因为我发现一些帖子表明插入符可能对小标题有奇怪的反应,所以我尝试将我的数据集转换为通用数据框 (v1.1):
dataset <- as.data.frame(dataset)
dataset %>% str()
'data.frame': 170657 obs. of 9 variables:
$ request_type : Factor w/ 29 levels "Abandoned.Vehicle",..: 10 10 10 10 10 10 10 10 10 10 ...
$ city_department: Factor w/ 8 levels "Center","City.Light",..: 3 3 3 3 3 3 3 3 3 3 ...
$ neighborhood : Factor w/ 91 levels "Adams","Alki",..: 1 1 4 4 10 13 21 21 21 24 ...
$ weekday : Ord.factor w/ 7 levels "So"<"Mo"<"Di"<..: 5 2 2 5 1 3 6 4 4 2 ...
$ month : Ord.factor w/ 12 levels "Jän"<"Feb"<"Mär"<..: 4 6 1 3 4 3 2 4 7 5 ...
$ cal_week : num 15 23 2 10 17 10 6 16 29 21 ...
$ holiday : Factor w/ 2 levels "noholiday","holiday": 1 1 1 1 1 1 1 1 1 1 ...
$ businessday : Factor w/ 2 levels "businessday",..: 1 1 1 1 2 1 1 1 1 1 ...
$ goodfriday : Factor w/ 2 levels "nogoodfriday",..: 1 1 1 1 1 1 1 1 1 1 ...
dataset %>% skim()
Skim summary statistics
n obs: 170657
n variables: 9
── Variable type:factor ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
variable missing complete n n_unique top_counts ordered
businessday 0 170657 170657 2 bus: 136087, nob: 34570, NA: 0 FALSE
city_department 0 170657 170657 8 Pol: 54916, Pub: 38171, Dep: 34712, Fin: 25471 FALSE
goodfriday 0 170657 170657 2 nog: 170140, goo: 517, NA: 0 FALSE
holiday 0 170657 170657 2 noh: 167514, hol: 3143, NA: 0 FALSE
month 0 170657 170657 12 Aug: 15247, Okt: 14807, Sep: 14785, Mär: 14781 TRUE
neighborhood 6447 164210 170657 91 NA: 6447, Bro: 4975, Uni: 3941, Wal: 3919 FALSE
request_type 0 170657 170657 29 Aba: 34478, Cus: 22275, Ill: 22033, Par: 16521 FALSE
weekday 0 170657 170657 7 Di: 28972, Mi: 28734, Mo: 28721, Do: 27298 TRUE
── Variable type:numeric ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
variable missing complete n mean sd p0 p25 p50 p75 p100 hist
cal_week 0 170657 170657 26.52 14.78 1 14 27 39 53 ▇▇▇▇▇▇▆▆
虽然只是 class data.frame,但它并没有解决我的问题。
我想我找到了问题的根源:
我最初使用 tidyverse 的 readr::read_csv(),它以某种方式给了我一个具有奇怪行为的数据对象(正如评论中也指出的误用 - 感谢您的输入):
dataset <- read_csv("data/DataSet.csv") %>% clean_names()
使用 read.csv() 后,我的数据集中不再有 NA,插入符号的所有函数突然都适用于我的数据:
dataset <- read.csv("data/DataSet.csv", stringsAsFactors = FALSE) %>% clean_names()
也许这个发现对其他人也有帮助,因为我浪费了大量时间寻找由错误数据集对象导致的错误消息。
更新
现在我知道为什么没有北美的anmymore了。我发现 read.csv() 读取 NA 但使它们成为空字符串 ("") 而 read_csv() 明确使它们成为 NA。我也只是将 NA 转换为一个因子 ("missing"),因此我不必删除数据并冒着丢失信息的风险。