一种热编码创建 n-1 个虚拟变量
One hot encoding creating n-1 dummy variables
为了对数据集中的因子变量进行一次性编码,我在此 post 中使用了用户 "Ben" 的强大功能:
one_hot <- function(dt, cols="auto", dropCols=TRUE, dropUnusedLevels=FALSE){
# One-Hot-Encode unordered factors in a data.table
# If cols = "auto", each unordered factor column in dt will be encoded. (Or specifcy a vector of column names to encode)
# If dropCols=TRUE, the original factor columns are dropped
# If dropUnusedLevels = TRUE, unused factor levels are dropped
# Automatically get the unordered factor columns
if(cols[1] == "auto") cols <- colnames(dt)[which(sapply(dt, function(x) is.factor(x) & !is.ordered(x)))]
# Build tempDT containing and ID column and 'cols' columns
tempDT <- dt[, cols, with=FALSE]
tempDT[, ID := .I]
setcolorder(tempDT, unique(c("ID", colnames(tempDT))))
for(col in cols) set(tempDT, j=col, value=factor(paste(col, tempDT[[col]], sep="_"), levels=paste(col, levels(tempDT[[col]]), sep="_")))
# One-hot-encode
if(dropUnusedLevels == TRUE){
newCols <- dcast(melt(tempDT, id = 'ID', value.factor = T), ID ~ value, drop = T, fun = length)
} else{
newCols <- dcast(melt(tempDT, id = 'ID', value.factor = T), ID ~ value, drop = F, fun = length)
}
# Combine binarized columns with the original dataset
result <- cbind(dt, newCols[, !"ID"])
# If dropCols = TRUE, remove the original factor columns
if(dropCols == TRUE){
result <- result[, !cols, with=FALSE]
}
return(result)
}
该函数为每个因子列的所有 n 个因子水平创建 n 个虚拟变量。但是由于我想使用数据进行建模,所以每个因子列只需要 n-1 个虚拟变量。这可能吗?如果可以,我该如何使用此功能执行此操作?
从我的角度来看,这条线必须调整:
newCols <- dcast(melt(tempDT, id = 'ID', value.factor = T), ID ~ value, drop = T, fun = length)
这是输入 table...
ID color size
1: 1 black large
2: 2 green medium
3: 3 red small
library(data.table)
DT = setDT(structure(list(ID = 1:3, color = c("black", "green", "red"),
size = c("large", "medium", "small")), .Names = c("ID", "color",
"size"), row.names = c(NA, -3L), class = "data.frame"))
...以及所需的输出 table:
ID color.black color.green size.large size.medium
1 1 0 1 0
2 0 1 0 1
3 0 0 0 0
这是执行全秩虚拟化的解决方案(即创建 n-1 列以避免共线性):
require('caret')
data.table(ID=DT$ID, predict(dummyVars(ID ~ ., DT, fullRank = T),DT))
这完全符合要求:
ID colorgreen colorred sizemedium sizesmall
1: 1 0 0 0 0
2: 2 1 0 1 0
3: 3 0 1 0 1
有关所有可用选项,请参阅 this for a friendly walkthrough of this function, and ?dummyVars。
此外:在评论中,OP 提到需要对数百万行和数千列执行此操作,因此证明需要 data.table
。如果这个简单的预处理步骤对 "computing muscle" 来说太多了,那么恐怕建模步骤(也就是真正的交易)注定要失败。
为了对数据集中的因子变量进行一次性编码,我在此 post 中使用了用户 "Ben" 的强大功能:
one_hot <- function(dt, cols="auto", dropCols=TRUE, dropUnusedLevels=FALSE){
# One-Hot-Encode unordered factors in a data.table
# If cols = "auto", each unordered factor column in dt will be encoded. (Or specifcy a vector of column names to encode)
# If dropCols=TRUE, the original factor columns are dropped
# If dropUnusedLevels = TRUE, unused factor levels are dropped
# Automatically get the unordered factor columns
if(cols[1] == "auto") cols <- colnames(dt)[which(sapply(dt, function(x) is.factor(x) & !is.ordered(x)))]
# Build tempDT containing and ID column and 'cols' columns
tempDT <- dt[, cols, with=FALSE]
tempDT[, ID := .I]
setcolorder(tempDT, unique(c("ID", colnames(tempDT))))
for(col in cols) set(tempDT, j=col, value=factor(paste(col, tempDT[[col]], sep="_"), levels=paste(col, levels(tempDT[[col]]), sep="_")))
# One-hot-encode
if(dropUnusedLevels == TRUE){
newCols <- dcast(melt(tempDT, id = 'ID', value.factor = T), ID ~ value, drop = T, fun = length)
} else{
newCols <- dcast(melt(tempDT, id = 'ID', value.factor = T), ID ~ value, drop = F, fun = length)
}
# Combine binarized columns with the original dataset
result <- cbind(dt, newCols[, !"ID"])
# If dropCols = TRUE, remove the original factor columns
if(dropCols == TRUE){
result <- result[, !cols, with=FALSE]
}
return(result)
}
该函数为每个因子列的所有 n 个因子水平创建 n 个虚拟变量。但是由于我想使用数据进行建模,所以每个因子列只需要 n-1 个虚拟变量。这可能吗?如果可以,我该如何使用此功能执行此操作?
从我的角度来看,这条线必须调整:
newCols <- dcast(melt(tempDT, id = 'ID', value.factor = T), ID ~ value, drop = T, fun = length)
这是输入 table...
ID color size
1: 1 black large
2: 2 green medium
3: 3 red small
library(data.table)
DT = setDT(structure(list(ID = 1:3, color = c("black", "green", "red"),
size = c("large", "medium", "small")), .Names = c("ID", "color",
"size"), row.names = c(NA, -3L), class = "data.frame"))
...以及所需的输出 table:
ID color.black color.green size.large size.medium
1 1 0 1 0
2 0 1 0 1
3 0 0 0 0
这是执行全秩虚拟化的解决方案(即创建 n-1 列以避免共线性):
require('caret')
data.table(ID=DT$ID, predict(dummyVars(ID ~ ., DT, fullRank = T),DT))
这完全符合要求:
ID colorgreen colorred sizemedium sizesmall
1: 1 0 0 0 0
2: 2 1 0 1 0
3: 3 0 1 0 1
有关所有可用选项,请参阅 this for a friendly walkthrough of this function, and ?dummyVars。
此外:在评论中,OP 提到需要对数百万行和数千列执行此操作,因此证明需要 data.table
。如果这个简单的预处理步骤对 "computing muscle" 来说太多了,那么恐怕建模步骤(也就是真正的交易)注定要失败。