column/variable 的虚拟化

Dummyfication of a column/variable

我正在用 R 设计一个神经网络。为此我必须准备数据并导入 table。

例如:

      time    hour Money day
1:  20000616    1  9.35   5
2:  20000616    2  6.22   5 
3:  20000616    3  10.65  5
4:  20000616    4  11.42  5
5:  20000616    5  10.12  5
6:  20000616    6  7.32   5

现在我需要一个虚拟化。我的最终 table 应该是这样的:

      time    Money day  1   2   3   4   5   6   
1:  20000616  9.35   5   1   0   0   0   0   0
2:  20000616  6.22   5   0   1   0   0   0   0
3:  20000616  10.65  5   0   0   1   0   0   0
4:  20000616  11.42  5   0   0   0   1   0   0
5:  20000616  10.12  5   0   0   0   0   1   0
6:  20000616  7.32   5   0   0   0   0   0   1

有没有一种简单的 way/smart 方法可以将我的 table 转换为新布局? 或者以编程方式在 R 中?我需要在 R 中执行此操作,而不是在导入之前。

提前致谢

data.table(您显然正在使用)的可能解决方案:

dt[dcast(dt, hour ~ hour, value.var = 'hour', fun = length), on = .(hour)]

给出:

       time hour Money day 1 2 3 4 5 6
1: 20000616    1  9.35   5 1 0 0 0 0 0
2: 20000616    2  6.22   5 0 1 0 0 0 0
3: 20000616    3 10.65   5 0 0 1 0 0 0
4: 20000616    4 11.42   5 0 0 0 1 0 0
5: 20000616    5 10.12   5 0 0 0 0 1 0
6: 20000616    6  7.32   5 0 0 0 0 0 1

我想在您的真实数据集中,timeday 会有更多变化,然后您可以将代码调整为:

dt[dcast(dt, time + day + hour ~ hour, value.var = 'hour', fun = length)
   , on = .(time, day, hour)]

已用数据:

dt <- fread(' time    hour Money day
20000616    1  9.35   5
20000616    2  6.22   5 
20000616    3  10.65  5
20000616    4  11.42  5
20000616    5  10.12  5
20000616    6  7.32   5')

你应该把你的目标分解成更小的可行问题。

  1. 创建 0 矩阵
  2. 用 1 填充对角线
  3. 将矩阵添加到您的原始数据

# 0. Create data 
df <- mtcars[1:6, 1:4]
                   mpg cyl disp  hp
Mazda RX4         21.0   6  160 110
Mazda RX4 Wag     21.0   6  160 110
Datsun 710        22.8   4  108  93
Hornet 4 Drive    21.4   6  258 110
Hornet Sportabout 18.7   8  360 175
Valiant           18.1   6  225 105
# 1. Create matrix of 0's
foo <- matrix(rep(0, nrow(df) ^ 2), nrow(df))

# 2. Fill diagonal
diag(foo) <- 1

# 3. Combine with original data
cbind(df, foo)
                   mpg cyl disp  hp 1 2 3 4 5 6
Mazda RX4         21.0   6  160 110 1 0 0 0 0 0
Mazda RX4 Wag     21.0   6  160 110 0 1 0 0 0 0
Datsun 710        22.8   4  108  93 0 0 1 0 0 0
Hornet 4 Drive    21.4   6  258 110 0 0 0 1 0 0
Hornet Sportabout 18.7   8  360 175 0 0 0 0 1 0
Valiant           18.1   6  225 105 0 0 0 0 0 1

基本解决方案如下:

dat <- data.frame(time = c(20000616, 20000616, 20000616, 20000616, 20000616, 20000616), 
hour = c(1, 2, 3, 4, 5, 6), 
Money = c(9.35, 6.22, 10.65, 11.42, 10.12, 7.32), 
day = c(5, 5, 5, 5, 5, 5) )

dat$dummy_day <- factor(dat$day, levels = 1:7)

model.matrix(~time + hour + Money + day + dummy_day, dat, 
             contrasts = list(dummy_day = "contr.SAS"))

它returns一个矩阵:

  (Intercept)     time hour Money day dummy_day1 dummy_day2 dummy_day3 dummy_day4 dummy_day5 dummy_day6
1           1 20000616    1  9.35   5          0          0          0          0          1          0
2           1 20000616    2  6.22   5          0          0          0          0          1          0
3           1 20000616    3 10.65   5          0          0          0          0          1          0
4           1 20000616    4 11.42   5          0          0          0          0          1          0
5           1 20000616    5 10.12   5          0          0          0          0          1          0
6           1 20000616    6  7.32   5          0          0          0          0          1          0
attr(,"assign")
 [1] 0 1 2 3 4 5 5 5 5 5 5
attr(,"contrasts")
attr(,"contrasts")$dummy_day
[1] "contr.SAS"

您可以使用 dummies 包轻松创建虚拟变量。

library(dummies)

df <- data.frame(
  time = c(20000616, 20000616, 20000616, 20000616, 20000616, 20000616), 
  hour = c(1, 2, 3, 4, 5, 6), 
  Money = c(9.35, 6.22, 10.65, 11.42, 10.12, 7.32), 
  day = c(5, 5, 5, 5, 5, 5))

# Specify the categorical variables in the dummy.data.frame function.
df_dummy <- dummy.data.frame(df, names=c("hour"), sep="_")
names(df_dummy) <- c("time", 1:6, "Money", "day")
df_dummy <- df_dummy[c("time", "Money", "day", 1:6)]
df_dummy
# time Money day 1 2 3 4 5 6
# 1 20000616  9.35   5 1 0 0 0 0 0
# 2 20000616  6.22   5 0 1 0 0 0 0
# 3 20000616 10.65   5 0 0 1 0 0 0
# 4 20000616 11.42   5 0 0 0 1 0 0
# 5 20000616 10.12   5 0 0 0 0 1 0
# 6 20000616  7.32   5 0 0 0 0 0 1

其他一些人提到使用 model.matrix 来获取设计矩阵。这是一个很好的解决方案。但我发现我通常想自定义如何处理缺失值或如何折叠稀有级别。因此,这是一个您可以自定义的替代函数。

```

    one_hot_encode <- function(DT, cols_to_encode, include_last = TRUE
                               , protected_NA_val = 'NA_MISSING'
    ) {
        for (col in cols_to_encode) {
            level_freq <- DT[, sort(table(get(col), useNA = 'ifany')
                                    , decreasing = TRUE)]
            level_names <- names(level_freq)
            level_names[is.na(level_names)] <- protected_NA_val
            if (!include_last) {
                level_names <- level_names[-length(level_names)]
            }
            for (lev in level_names) {
                new_col_name <- paste('ONE_HOT', col, lev, sep = '_')
                DT[, (new_col_name) := 0]
                if (lev == protected_NA_val) {
                    DT[is.na(get(col)), (new_col_name) := 1]
                } else {
                    DT[get(col) == lev, (new_col_name) := 1]
                }
            }
        }
        return(DT)
    }

```

因此,将此函数应用于您的数据集变为:

```

    DT <- data.table(
        time = c(20000616, 20000616, 20000616, 20000616, 20000616, 20000616)
        , hour = c(1, 2, 3, 4, 5, 6)
        , money = c(9.35, 6.22, 10.65, 11.42, 10.12, 7.32)
        , day = c(5, 5, 5, 5, 5, 5)
    )
    DT <- one_hot_encode(DT, 'hour')

```