R:从数据创建指标变量并提取类别

R: Create indicator variables from data and extract category

我有一个数据框 df 存储了不同年份几千株植物的平均高度(以厘米为单位):

Name    Year    Height
Plant1  2010    440
Plant2  2011    60
Plant1  2011    1980
Plant3  2013    650
Plant4  2016    210

我想执行以下操作:

a) 为 400 厘米和 2000 厘米(含)之间的每个 50 厘米的高度间隔创建一个变量,其中两个变量 <400 和 >2000。 df 应该是这样的:

Name    Year    Height h_0_400 h_400 h_450 h_500 h_550 etc.
Plant1  2010    440    
Plant2  2011    60
Plant1  2011    1980
Plant3  2013    640
Plant4  2016    210

b) 根据实际height:

分配变量0或1
Name    Year    Height h_0_400 h_400 h_450 h_500 h_550 etc.
Plant1  2010    440    0       1     0     0     0
Plant2  2011    60     1       0     0     0     0
Plant1  2011    1980   0       0     0     0     0
Plant3  2013    640    0       0     0     0     0
Plant4  2016    210    1       0     0     0     0

c) 添加一个变量,指示条目属于 heights 的哪个类别

Name    Year    Height h_0_400 h_400 h_450 h_500 h_550 etc. height_index
Plant1  2010    440    0       1     0     0     0          h_400
Plant2  2011    60     1       0     0     0     0          h_0_400
Plant1  2011    1980   0       0     0     0     0          h_1950
Plant3  2013    640    0       0     0     0     0          h_600
Plant4  2016    210    1       0     0     0     0          h_0_400

我不确定如何解决这个问题,如果有任何见解,我将不胜感激。到目前为止,我已经尝试使用 seq(400,2000,by=1) 并删除不需要的值,但这似乎效率很低。 我很乐意使用任何包。非常感谢!

一个选项是使用cut(或使用findInterval)创建一个变量组,然后重塑为宽格式

library(dplyr)
library(tidyr)
library(stringr)
out <- df %>%
   # // create grouping variable with cut based on the Height
   mutate(ind = cut(Height, breaks = c(-Inf, c(0, seq(400, 2000,
          by = 50 ))), labels = c('h_0_400', 
  str_c('h_', seq(400, 2000, by = 50)))), height_index = ind, n = 1)  %>%
   # // reshape to wide format
   pivot_wider(names_from = ind, values_from = n, values_fill= list(n = 0))

# // missing columns are created with setdiff and assigned to 0
out[setdiff(levels(out$height_index), out$height_index)] <- 0

数据

df <- structure(list(Name = c("Plant1", "Plant2", "Plant1", "Plant3", 
"Plant4"), Year = c(2010L, 2011L, 2011L, 2013L, 2016L), Height = c(340L, 
60L, 1980L, 650L, 210L)), class = "data.frame", row.names = c(NA, 
-5L))

这是在 base R 中执行此操作的一种方法:

#Create a sequence
vals <- seq(400, 2000, 50)
#Create column names
cols <- paste('h', c(0, vals[-length(vals)]), vals, sep = "_")
#Initialize new columns with 0
df[cols] <- 0
#Find which interval the height lies 
inds <- findInterval(df$Height, vals) + 1
#Make the respective column as 1
df[cols][cbind(1:nrow(df), inds)] <- 1
#Create a new column giving the column name
df$height_index <- cols[inds]

最终数据框如下所示:

df
#    Name Year Height h_0_400 h_400_450 h_450_500 h_500_550 h_550_600
#1 Plant1 2010    440       0         1         0         0         0
#2 Plant2 2011     60       1         0         0         0         0
#3 Plant1 2011   1980       0         0         0         0         0
#4 Plant3 2013    650       0         0         0         0         0
#5 Plant4 2016    210       1         0         0         0         0
#  h_600_650 h_650_700 h_700_750 h_750_800 h_800_850 h_850_900 h_900_950
#1         0         0         0         0         0         0         0
#2         0         0         0         0         0         0         0
#3         0         0         0         0         0         0         0
#4         0         1         0         0         0         0         0
#5         0         0         0         0         0         0         0
#  h_950_1000 h_1000_1050 h_1050_1100 h_1100_1150 h_1150_1200 h_1200_1250
#1          0           0           0           0           0           0
#2          0           0           0           0           0           0
#3          0           0           0           0           0           0
#4          0           0           0           0           0           0
#5          0           0           0           0           0           0
#  h_1250_1300 h_1300_1350 h_1350_1400 h_1400_1450 h_1450_1500 h_1500_1550
#1           0           0           0           0           0           0
#2           0           0           0           0           0           0
#3           0           0           0           0           0           0
#4           0           0           0           0           0           0
#5           0           0           0           0           0           0
#  h_1550_1600 h_1600_1650 h_1650_1700 h_1700_1750 h_1750_1800 h_1800_1850
#1           0           0           0           0           0           0
#2           0           0           0           0           0           0
#3           0           0           0           0           0           0
#4           0           0           0           0           0           0
#5           0           0           0           0           0           0
#  h_1850_1900 h_1900_1950 h_1950_2000 height_index
#1           0           0           0    h_400_450
#2           0           0           0      h_0_400
#3           0           0           1  h_1950_2000
#4           0           0           0    h_650_700
#5           0           0           0      h_0_400