R 中一组列的热编码
Hot encoding for a set of columns in R
我正在尝试对 R 中的 df 列子集进行热编码,
一个热编码是一个过程,通过该过程将分类变量转换为可以提供给 ML 算法的形式,通过将字符串列转换为该列中每个字符串的二进制列来更好地进行预测。
假设我们有一个看起来像这样的 df:
mes work_location birth_place
01/01/2000 China Chile
01/02/2000 Mexico Japan
01/03/2000 China Chile
01/04/2000 China Argentina
01/05/2000 USA Poland
01/06/2000 Mexico Poland
01/07/2000 USA Finland
01/08/2000 USA Finland
01/09/2000 Japan Norway
01/10/2000 Japan Kenia
01/11/2000 Japan Mali
01/12/2000 India Mali
这是热编码的代码:
## function to hot-encode ##
columna_dummy <- function(df, columna) {
df %>%
mutate_at(columna, ~paste(columna, eval(as.symbol(columna)), sep = "_")) %>%
mutate(valor = 1) %>%
spread(key = columna, value = valor, fill = 0)
}
## selecting columns ##
columnas <- c("work_location", "birth_place")
## applying loop to repeat columna_dummy function for each df column ##
for(i in 1:length(columnas)){
new_dataset <- columna_dummy(df, i)
}
控制台输出:
Error: Problem with `mutate()` input `mes`.
x objeto '1' no encontrado
i Input `mes` is `(structure(function (..., .x = ..1, .y = ..2, . = ..1) ...`.
Run `rlang::last_error()` to see where the error occurred.
Called from: signal_abort(cnd)
列 mes
是日期 class 列,但是它不包含在列原子向量中
它仍然会引发上述错误,
对于所选字符串 df 列中的每个字符串,预期输出应该看起来像这样:
(我无法添加每一列,但 work_location_China 这是一个示例
列的外观)
mes work_location birth_place work_location_China
01/01/2000 China Chile 1
01/02/2000 Mexico Japan 0
01/03/2000 China Chile 1
01/04/2000 China Argentina 1
01/05/2000 USA Poland 0
01/06/2000 Mexico Poland 0
01/07/2000 USA Finland 0
01/08/2000 USA Finland 0
01/09/2000 Japan Norway 0
01/10/2000 Japan Kenia 0
01/11/2000 Japan Mali 0
01/12/2000 India Mali 0
还有其他方法可以应用这个循环吗?
当我们传递字符串时,一个选项是 select
列(select
可以同时使用 quoted/unquoted),创建一列 1('valor')和行号列 ('rn'),然后从 'long' 重塑为 'wide' (pivot_wider
)
library(dplyr)
library(tidyr)
library(purrr)
library(stringr)
columna_dummy <- function(df, columna) {
df %>%
select(columna) %>%
mutate(valor = 1, rn = row_number()) %>%
pivot_wider(names_from = all_of(columna),
values_from = valor, values_fill = 0) %>%
select(-rn)
}
-测试
对于多个列,一个选项是使用 map
遍历感兴趣的列名称,应用函数并将它们与 _dfc
绑定并与原始数据集绑定(bind_cols
)
out <- imap_dfc(setNames(c("work_location", "birth_place"),
c("work_location", "birth_place")) , ~ {
nm1 <- as.character(.y)
columna_dummy(df = df, columna = .x) %>%
rename_all(~ str_c(nm1, ., sep="_"))
}) %>%
bind_cols(df, .)
-输出
head(out, 2)
# mes work_location birth_place work_location_China work_location_Mexico work_location_USA work_location_Japan
#1 01/01/2000 China Chile 1 0 0 0
#2 01/02/2000 Mexico Japan 0 1 0 0
# work_location_India birth_place_Chile birth_place_Japan birth_place_Argentina birth_place_Poland birth_place_Finland
#1 0 1 0 0 0 0
#2 0 0 1 0 0 0
# birth_place_Norway birth_place_Kenia birth_place_Mali
#1 0 0 0
#2 0 0 0
数据
df <- structure(list(mes = c("01/01/2000", "01/02/2000", "01/03/2000",
"01/04/2000", "01/05/2000", "01/06/2000", "01/07/2000", "01/08/2000",
"01/09/2000", "01/10/2000", "01/11/2000", "01/12/2000"), work_location = c("China",
"Mexico", "China", "China", "USA", "Mexico", "USA", "USA", "Japan",
"Japan", "Japan", "India"), birth_place = c("Chile", "Japan",
"Chile", "Argentina", "Poland", "Poland", "Finland", "Finland",
"Norway", "Kenia", "Mali", "Mali")), class = "data.frame",
row.names = c(NA,
-12L))
通过使用 purrr 库我解决了这个问题:
## data ##
df <- structure(list(mes = c("01/01/2000", "01/02/2000", "01/03/2000",
"01/04/2000", "01/05/2000", "01/06/2000", "01/07/2000", "01/08/2000",
"01/09/2000", "01/10/2000", "01/11/2000", "01/12/2000"), work_location = c("China",
"Mexico", "China", "China", "USA", "Mexico", "USA", "USA", "Japan",
"Japan", "Japan", "India"), birth_place = c("Chile", "Japan",
"Chile", "Argentina", "Poland", "Poland", "Finland", "Finland",
"Norway", "Kenia", "Mali", "Mali")), class = "data.frame",
row.names = c(NA,
-12L))
## function to hot-encode ##
columna_dummy <- function(df, columna) {
df %>%
mutate_at(columna, ~paste(columna, eval(as.symbol(columna)), sep = "_")) %>%
mutate(valor = 1) %>%
spread(key = columna, value = valor, fill = 0)
}
## vector of columns ##
columnas <- c("work_location", "birth_place")
## hot_encoded_dataset ##
library(purrr)
hot_encoded_dataset <- purrr :: map(columnas , columna_dummy, df = df) %>%
reduce(inner_join)
我正在尝试对 R 中的 df 列子集进行热编码,
一个热编码是一个过程,通过该过程将分类变量转换为可以提供给 ML 算法的形式,通过将字符串列转换为该列中每个字符串的二进制列来更好地进行预测。
假设我们有一个看起来像这样的 df:
mes work_location birth_place
01/01/2000 China Chile
01/02/2000 Mexico Japan
01/03/2000 China Chile
01/04/2000 China Argentina
01/05/2000 USA Poland
01/06/2000 Mexico Poland
01/07/2000 USA Finland
01/08/2000 USA Finland
01/09/2000 Japan Norway
01/10/2000 Japan Kenia
01/11/2000 Japan Mali
01/12/2000 India Mali
这是热编码的代码:
## function to hot-encode ##
columna_dummy <- function(df, columna) {
df %>%
mutate_at(columna, ~paste(columna, eval(as.symbol(columna)), sep = "_")) %>%
mutate(valor = 1) %>%
spread(key = columna, value = valor, fill = 0)
}
## selecting columns ##
columnas <- c("work_location", "birth_place")
## applying loop to repeat columna_dummy function for each df column ##
for(i in 1:length(columnas)){
new_dataset <- columna_dummy(df, i)
}
控制台输出:
Error: Problem with `mutate()` input `mes`.
x objeto '1' no encontrado
i Input `mes` is `(structure(function (..., .x = ..1, .y = ..2, . = ..1) ...`.
Run `rlang::last_error()` to see where the error occurred.
Called from: signal_abort(cnd)
列 mes
是日期 class 列,但是它不包含在列原子向量中
它仍然会引发上述错误,
对于所选字符串 df 列中的每个字符串,预期输出应该看起来像这样:
(我无法添加每一列,但 work_location_China 这是一个示例 列的外观)
mes work_location birth_place work_location_China
01/01/2000 China Chile 1
01/02/2000 Mexico Japan 0
01/03/2000 China Chile 1
01/04/2000 China Argentina 1
01/05/2000 USA Poland 0
01/06/2000 Mexico Poland 0
01/07/2000 USA Finland 0
01/08/2000 USA Finland 0
01/09/2000 Japan Norway 0
01/10/2000 Japan Kenia 0
01/11/2000 Japan Mali 0
01/12/2000 India Mali 0
还有其他方法可以应用这个循环吗?
当我们传递字符串时,一个选项是 select
列(select
可以同时使用 quoted/unquoted),创建一列 1('valor')和行号列 ('rn'),然后从 'long' 重塑为 'wide' (pivot_wider
)
library(dplyr)
library(tidyr)
library(purrr)
library(stringr)
columna_dummy <- function(df, columna) {
df %>%
select(columna) %>%
mutate(valor = 1, rn = row_number()) %>%
pivot_wider(names_from = all_of(columna),
values_from = valor, values_fill = 0) %>%
select(-rn)
}
-测试
对于多个列,一个选项是使用 map
遍历感兴趣的列名称,应用函数并将它们与 _dfc
绑定并与原始数据集绑定(bind_cols
)
out <- imap_dfc(setNames(c("work_location", "birth_place"),
c("work_location", "birth_place")) , ~ {
nm1 <- as.character(.y)
columna_dummy(df = df, columna = .x) %>%
rename_all(~ str_c(nm1, ., sep="_"))
}) %>%
bind_cols(df, .)
-输出
head(out, 2)
# mes work_location birth_place work_location_China work_location_Mexico work_location_USA work_location_Japan
#1 01/01/2000 China Chile 1 0 0 0
#2 01/02/2000 Mexico Japan 0 1 0 0
# work_location_India birth_place_Chile birth_place_Japan birth_place_Argentina birth_place_Poland birth_place_Finland
#1 0 1 0 0 0 0
#2 0 0 1 0 0 0
# birth_place_Norway birth_place_Kenia birth_place_Mali
#1 0 0 0
#2 0 0 0
数据
df <- structure(list(mes = c("01/01/2000", "01/02/2000", "01/03/2000",
"01/04/2000", "01/05/2000", "01/06/2000", "01/07/2000", "01/08/2000",
"01/09/2000", "01/10/2000", "01/11/2000", "01/12/2000"), work_location = c("China",
"Mexico", "China", "China", "USA", "Mexico", "USA", "USA", "Japan",
"Japan", "Japan", "India"), birth_place = c("Chile", "Japan",
"Chile", "Argentina", "Poland", "Poland", "Finland", "Finland",
"Norway", "Kenia", "Mali", "Mali")), class = "data.frame",
row.names = c(NA,
-12L))
通过使用 purrr 库我解决了这个问题:
## data ##
df <- structure(list(mes = c("01/01/2000", "01/02/2000", "01/03/2000",
"01/04/2000", "01/05/2000", "01/06/2000", "01/07/2000", "01/08/2000",
"01/09/2000", "01/10/2000", "01/11/2000", "01/12/2000"), work_location = c("China",
"Mexico", "China", "China", "USA", "Mexico", "USA", "USA", "Japan",
"Japan", "Japan", "India"), birth_place = c("Chile", "Japan",
"Chile", "Argentina", "Poland", "Poland", "Finland", "Finland",
"Norway", "Kenia", "Mali", "Mali")), class = "data.frame",
row.names = c(NA,
-12L))
## function to hot-encode ##
columna_dummy <- function(df, columna) {
df %>%
mutate_at(columna, ~paste(columna, eval(as.symbol(columna)), sep = "_")) %>%
mutate(valor = 1) %>%
spread(key = columna, value = valor, fill = 0)
}
## vector of columns ##
columnas <- c("work_location", "birth_place")
## hot_encoded_dataset ##
library(purrr)
hot_encoded_dataset <- purrr :: map(columnas , columna_dummy, df = df) %>%
reduce(inner_join)