column/variable 的虚拟化
Dummyfication of a column/variable
我正在用 R 设计一个神经网络。为此我必须准备数据并导入 table。
例如:
time hour Money day
1: 20000616 1 9.35 5
2: 20000616 2 6.22 5
3: 20000616 3 10.65 5
4: 20000616 4 11.42 5
5: 20000616 5 10.12 5
6: 20000616 6 7.32 5
现在我需要一个虚拟化。我的最终 table 应该是这样的:
time Money day 1 2 3 4 5 6
1: 20000616 9.35 5 1 0 0 0 0 0
2: 20000616 6.22 5 0 1 0 0 0 0
3: 20000616 10.65 5 0 0 1 0 0 0
4: 20000616 11.42 5 0 0 0 1 0 0
5: 20000616 10.12 5 0 0 0 0 1 0
6: 20000616 7.32 5 0 0 0 0 0 1
有没有一种简单的 way/smart 方法可以将我的 table 转换为新布局?
或者以编程方式在 R 中?我需要在 R 中执行此操作,而不是在导入之前。
提前致谢
data.table
(您显然正在使用)的可能解决方案:
dt[dcast(dt, hour ~ hour, value.var = 'hour', fun = length), on = .(hour)]
给出:
time hour Money day 1 2 3 4 5 6
1: 20000616 1 9.35 5 1 0 0 0 0 0
2: 20000616 2 6.22 5 0 1 0 0 0 0
3: 20000616 3 10.65 5 0 0 1 0 0 0
4: 20000616 4 11.42 5 0 0 0 1 0 0
5: 20000616 5 10.12 5 0 0 0 0 1 0
6: 20000616 6 7.32 5 0 0 0 0 0 1
我想在您的真实数据集中,time
和 day
会有更多变化,然后您可以将代码调整为:
dt[dcast(dt, time + day + hour ~ hour, value.var = 'hour', fun = length)
, on = .(time, day, hour)]
已用数据:
dt <- fread(' time hour Money day
20000616 1 9.35 5
20000616 2 6.22 5
20000616 3 10.65 5
20000616 4 11.42 5
20000616 5 10.12 5
20000616 6 7.32 5')
你应该把你的目标分解成更小的可行问题。
- 创建 0 矩阵
- 用 1 填充对角线
- 将矩阵添加到您的原始数据
# 0. Create data
df <- mtcars[1:6, 1:4]
mpg cyl disp hp
Mazda RX4 21.0 6 160 110
Mazda RX4 Wag 21.0 6 160 110
Datsun 710 22.8 4 108 93
Hornet 4 Drive 21.4 6 258 110
Hornet Sportabout 18.7 8 360 175
Valiant 18.1 6 225 105
# 1. Create matrix of 0's
foo <- matrix(rep(0, nrow(df) ^ 2), nrow(df))
# 2. Fill diagonal
diag(foo) <- 1
# 3. Combine with original data
cbind(df, foo)
mpg cyl disp hp 1 2 3 4 5 6
Mazda RX4 21.0 6 160 110 1 0 0 0 0 0
Mazda RX4 Wag 21.0 6 160 110 0 1 0 0 0 0
Datsun 710 22.8 4 108 93 0 0 1 0 0 0
Hornet 4 Drive 21.4 6 258 110 0 0 0 1 0 0
Hornet Sportabout 18.7 8 360 175 0 0 0 0 1 0
Valiant 18.1 6 225 105 0 0 0 0 0 1
基本解决方案如下:
dat <- data.frame(time = c(20000616, 20000616, 20000616, 20000616, 20000616, 20000616),
hour = c(1, 2, 3, 4, 5, 6),
Money = c(9.35, 6.22, 10.65, 11.42, 10.12, 7.32),
day = c(5, 5, 5, 5, 5, 5) )
dat$dummy_day <- factor(dat$day, levels = 1:7)
model.matrix(~time + hour + Money + day + dummy_day, dat,
contrasts = list(dummy_day = "contr.SAS"))
它returns一个矩阵:
(Intercept) time hour Money day dummy_day1 dummy_day2 dummy_day3 dummy_day4 dummy_day5 dummy_day6
1 1 20000616 1 9.35 5 0 0 0 0 1 0
2 1 20000616 2 6.22 5 0 0 0 0 1 0
3 1 20000616 3 10.65 5 0 0 0 0 1 0
4 1 20000616 4 11.42 5 0 0 0 0 1 0
5 1 20000616 5 10.12 5 0 0 0 0 1 0
6 1 20000616 6 7.32 5 0 0 0 0 1 0
attr(,"assign")
[1] 0 1 2 3 4 5 5 5 5 5 5
attr(,"contrasts")
attr(,"contrasts")$dummy_day
[1] "contr.SAS"
您可以使用 dummies
包轻松创建虚拟变量。
library(dummies)
df <- data.frame(
time = c(20000616, 20000616, 20000616, 20000616, 20000616, 20000616),
hour = c(1, 2, 3, 4, 5, 6),
Money = c(9.35, 6.22, 10.65, 11.42, 10.12, 7.32),
day = c(5, 5, 5, 5, 5, 5))
# Specify the categorical variables in the dummy.data.frame function.
df_dummy <- dummy.data.frame(df, names=c("hour"), sep="_")
names(df_dummy) <- c("time", 1:6, "Money", "day")
df_dummy <- df_dummy[c("time", "Money", "day", 1:6)]
df_dummy
# time Money day 1 2 3 4 5 6
# 1 20000616 9.35 5 1 0 0 0 0 0
# 2 20000616 6.22 5 0 1 0 0 0 0
# 3 20000616 10.65 5 0 0 1 0 0 0
# 4 20000616 11.42 5 0 0 0 1 0 0
# 5 20000616 10.12 5 0 0 0 0 1 0
# 6 20000616 7.32 5 0 0 0 0 0 1
其他一些人提到使用 model.matrix
来获取设计矩阵。这是一个很好的解决方案。但我发现我通常想自定义如何处理缺失值或如何折叠稀有级别。因此,这是一个您可以自定义的替代函数。
```
one_hot_encode <- function(DT, cols_to_encode, include_last = TRUE
, protected_NA_val = 'NA_MISSING'
) {
for (col in cols_to_encode) {
level_freq <- DT[, sort(table(get(col), useNA = 'ifany')
, decreasing = TRUE)]
level_names <- names(level_freq)
level_names[is.na(level_names)] <- protected_NA_val
if (!include_last) {
level_names <- level_names[-length(level_names)]
}
for (lev in level_names) {
new_col_name <- paste('ONE_HOT', col, lev, sep = '_')
DT[, (new_col_name) := 0]
if (lev == protected_NA_val) {
DT[is.na(get(col)), (new_col_name) := 1]
} else {
DT[get(col) == lev, (new_col_name) := 1]
}
}
}
return(DT)
}
```
因此,将此函数应用于您的数据集变为:
```
DT <- data.table(
time = c(20000616, 20000616, 20000616, 20000616, 20000616, 20000616)
, hour = c(1, 2, 3, 4, 5, 6)
, money = c(9.35, 6.22, 10.65, 11.42, 10.12, 7.32)
, day = c(5, 5, 5, 5, 5, 5)
)
DT <- one_hot_encode(DT, 'hour')
```
我正在用 R 设计一个神经网络。为此我必须准备数据并导入 table。
例如:
time hour Money day
1: 20000616 1 9.35 5
2: 20000616 2 6.22 5
3: 20000616 3 10.65 5
4: 20000616 4 11.42 5
5: 20000616 5 10.12 5
6: 20000616 6 7.32 5
现在我需要一个虚拟化。我的最终 table 应该是这样的:
time Money day 1 2 3 4 5 6
1: 20000616 9.35 5 1 0 0 0 0 0
2: 20000616 6.22 5 0 1 0 0 0 0
3: 20000616 10.65 5 0 0 1 0 0 0
4: 20000616 11.42 5 0 0 0 1 0 0
5: 20000616 10.12 5 0 0 0 0 1 0
6: 20000616 7.32 5 0 0 0 0 0 1
有没有一种简单的 way/smart 方法可以将我的 table 转换为新布局? 或者以编程方式在 R 中?我需要在 R 中执行此操作,而不是在导入之前。
提前致谢
data.table
(您显然正在使用)的可能解决方案:
dt[dcast(dt, hour ~ hour, value.var = 'hour', fun = length), on = .(hour)]
给出:
time hour Money day 1 2 3 4 5 6 1: 20000616 1 9.35 5 1 0 0 0 0 0 2: 20000616 2 6.22 5 0 1 0 0 0 0 3: 20000616 3 10.65 5 0 0 1 0 0 0 4: 20000616 4 11.42 5 0 0 0 1 0 0 5: 20000616 5 10.12 5 0 0 0 0 1 0 6: 20000616 6 7.32 5 0 0 0 0 0 1
我想在您的真实数据集中,time
和 day
会有更多变化,然后您可以将代码调整为:
dt[dcast(dt, time + day + hour ~ hour, value.var = 'hour', fun = length)
, on = .(time, day, hour)]
已用数据:
dt <- fread(' time hour Money day
20000616 1 9.35 5
20000616 2 6.22 5
20000616 3 10.65 5
20000616 4 11.42 5
20000616 5 10.12 5
20000616 6 7.32 5')
你应该把你的目标分解成更小的可行问题。
- 创建 0 矩阵
- 用 1 填充对角线
- 将矩阵添加到您的原始数据
# 0. Create data
df <- mtcars[1:6, 1:4]
mpg cyl disp hp Mazda RX4 21.0 6 160 110 Mazda RX4 Wag 21.0 6 160 110 Datsun 710 22.8 4 108 93 Hornet 4 Drive 21.4 6 258 110 Hornet Sportabout 18.7 8 360 175 Valiant 18.1 6 225 105
# 1. Create matrix of 0's
foo <- matrix(rep(0, nrow(df) ^ 2), nrow(df))
# 2. Fill diagonal
diag(foo) <- 1
# 3. Combine with original data
cbind(df, foo)
mpg cyl disp hp 1 2 3 4 5 6 Mazda RX4 21.0 6 160 110 1 0 0 0 0 0 Mazda RX4 Wag 21.0 6 160 110 0 1 0 0 0 0 Datsun 710 22.8 4 108 93 0 0 1 0 0 0 Hornet 4 Drive 21.4 6 258 110 0 0 0 1 0 0 Hornet Sportabout 18.7 8 360 175 0 0 0 0 1 0 Valiant 18.1 6 225 105 0 0 0 0 0 1
基本解决方案如下:
dat <- data.frame(time = c(20000616, 20000616, 20000616, 20000616, 20000616, 20000616),
hour = c(1, 2, 3, 4, 5, 6),
Money = c(9.35, 6.22, 10.65, 11.42, 10.12, 7.32),
day = c(5, 5, 5, 5, 5, 5) )
dat$dummy_day <- factor(dat$day, levels = 1:7)
model.matrix(~time + hour + Money + day + dummy_day, dat,
contrasts = list(dummy_day = "contr.SAS"))
它returns一个矩阵:
(Intercept) time hour Money day dummy_day1 dummy_day2 dummy_day3 dummy_day4 dummy_day5 dummy_day6
1 1 20000616 1 9.35 5 0 0 0 0 1 0
2 1 20000616 2 6.22 5 0 0 0 0 1 0
3 1 20000616 3 10.65 5 0 0 0 0 1 0
4 1 20000616 4 11.42 5 0 0 0 0 1 0
5 1 20000616 5 10.12 5 0 0 0 0 1 0
6 1 20000616 6 7.32 5 0 0 0 0 1 0
attr(,"assign")
[1] 0 1 2 3 4 5 5 5 5 5 5
attr(,"contrasts")
attr(,"contrasts")$dummy_day
[1] "contr.SAS"
您可以使用 dummies
包轻松创建虚拟变量。
library(dummies)
df <- data.frame(
time = c(20000616, 20000616, 20000616, 20000616, 20000616, 20000616),
hour = c(1, 2, 3, 4, 5, 6),
Money = c(9.35, 6.22, 10.65, 11.42, 10.12, 7.32),
day = c(5, 5, 5, 5, 5, 5))
# Specify the categorical variables in the dummy.data.frame function.
df_dummy <- dummy.data.frame(df, names=c("hour"), sep="_")
names(df_dummy) <- c("time", 1:6, "Money", "day")
df_dummy <- df_dummy[c("time", "Money", "day", 1:6)]
df_dummy
# time Money day 1 2 3 4 5 6
# 1 20000616 9.35 5 1 0 0 0 0 0
# 2 20000616 6.22 5 0 1 0 0 0 0
# 3 20000616 10.65 5 0 0 1 0 0 0
# 4 20000616 11.42 5 0 0 0 1 0 0
# 5 20000616 10.12 5 0 0 0 0 1 0
# 6 20000616 7.32 5 0 0 0 0 0 1
其他一些人提到使用 model.matrix
来获取设计矩阵。这是一个很好的解决方案。但我发现我通常想自定义如何处理缺失值或如何折叠稀有级别。因此,这是一个您可以自定义的替代函数。
```
one_hot_encode <- function(DT, cols_to_encode, include_last = TRUE
, protected_NA_val = 'NA_MISSING'
) {
for (col in cols_to_encode) {
level_freq <- DT[, sort(table(get(col), useNA = 'ifany')
, decreasing = TRUE)]
level_names <- names(level_freq)
level_names[is.na(level_names)] <- protected_NA_val
if (!include_last) {
level_names <- level_names[-length(level_names)]
}
for (lev in level_names) {
new_col_name <- paste('ONE_HOT', col, lev, sep = '_')
DT[, (new_col_name) := 0]
if (lev == protected_NA_val) {
DT[is.na(get(col)), (new_col_name) := 1]
} else {
DT[get(col) == lev, (new_col_name) := 1]
}
}
}
return(DT)
}
```
因此,将此函数应用于您的数据集变为:
```
DT <- data.table(
time = c(20000616, 20000616, 20000616, 20000616, 20000616, 20000616)
, hour = c(1, 2, 3, 4, 5, 6)
, money = c(9.35, 6.22, 10.65, 11.42, 10.12, 7.32)
, day = c(5, 5, 5, 5, 5, 5)
)
DT <- one_hot_encode(DT, 'hour')
```