R:从数据创建指标变量并提取类别
R: Create indicator variables from data and extract category
我有一个数据框 df
存储了不同年份几千株植物的平均高度(以厘米为单位):
Name Year Height
Plant1 2010 440
Plant2 2011 60
Plant1 2011 1980
Plant3 2013 650
Plant4 2016 210
我想执行以下操作:
a) 为 400 厘米和 2000 厘米(含)之间的每个 50 厘米的高度间隔创建一个变量,其中两个变量 <400 和 >2000。 df
应该是这样的:
Name Year Height h_0_400 h_400 h_450 h_500 h_550 etc.
Plant1 2010 440
Plant2 2011 60
Plant1 2011 1980
Plant3 2013 640
Plant4 2016 210
b) 根据实际height
:
分配变量0或1
Name Year Height h_0_400 h_400 h_450 h_500 h_550 etc.
Plant1 2010 440 0 1 0 0 0
Plant2 2011 60 1 0 0 0 0
Plant1 2011 1980 0 0 0 0 0
Plant3 2013 640 0 0 0 0 0
Plant4 2016 210 1 0 0 0 0
c) 添加一个变量,指示条目属于 heights
的哪个类别
Name Year Height h_0_400 h_400 h_450 h_500 h_550 etc. height_index
Plant1 2010 440 0 1 0 0 0 h_400
Plant2 2011 60 1 0 0 0 0 h_0_400
Plant1 2011 1980 0 0 0 0 0 h_1950
Plant3 2013 640 0 0 0 0 0 h_600
Plant4 2016 210 1 0 0 0 0 h_0_400
我不确定如何解决这个问题,如果有任何见解,我将不胜感激。到目前为止,我已经尝试使用 seq(400,2000,by=1)
并删除不需要的值,但这似乎效率很低。
我很乐意使用任何包。非常感谢!
一个选项是使用cut
(或使用findInterval
)创建一个变量组,然后重塑为宽格式
library(dplyr)
library(tidyr)
library(stringr)
out <- df %>%
# // create grouping variable with cut based on the Height
mutate(ind = cut(Height, breaks = c(-Inf, c(0, seq(400, 2000,
by = 50 ))), labels = c('h_0_400',
str_c('h_', seq(400, 2000, by = 50)))), height_index = ind, n = 1) %>%
# // reshape to wide format
pivot_wider(names_from = ind, values_from = n, values_fill= list(n = 0))
# // missing columns are created with setdiff and assigned to 0
out[setdiff(levels(out$height_index), out$height_index)] <- 0
数据
df <- structure(list(Name = c("Plant1", "Plant2", "Plant1", "Plant3",
"Plant4"), Year = c(2010L, 2011L, 2011L, 2013L, 2016L), Height = c(340L,
60L, 1980L, 650L, 210L)), class = "data.frame", row.names = c(NA,
-5L))
这是在 base R 中执行此操作的一种方法:
#Create a sequence
vals <- seq(400, 2000, 50)
#Create column names
cols <- paste('h', c(0, vals[-length(vals)]), vals, sep = "_")
#Initialize new columns with 0
df[cols] <- 0
#Find which interval the height lies
inds <- findInterval(df$Height, vals) + 1
#Make the respective column as 1
df[cols][cbind(1:nrow(df), inds)] <- 1
#Create a new column giving the column name
df$height_index <- cols[inds]
最终数据框如下所示:
df
# Name Year Height h_0_400 h_400_450 h_450_500 h_500_550 h_550_600
#1 Plant1 2010 440 0 1 0 0 0
#2 Plant2 2011 60 1 0 0 0 0
#3 Plant1 2011 1980 0 0 0 0 0
#4 Plant3 2013 650 0 0 0 0 0
#5 Plant4 2016 210 1 0 0 0 0
# h_600_650 h_650_700 h_700_750 h_750_800 h_800_850 h_850_900 h_900_950
#1 0 0 0 0 0 0 0
#2 0 0 0 0 0 0 0
#3 0 0 0 0 0 0 0
#4 0 1 0 0 0 0 0
#5 0 0 0 0 0 0 0
# h_950_1000 h_1000_1050 h_1050_1100 h_1100_1150 h_1150_1200 h_1200_1250
#1 0 0 0 0 0 0
#2 0 0 0 0 0 0
#3 0 0 0 0 0 0
#4 0 0 0 0 0 0
#5 0 0 0 0 0 0
# h_1250_1300 h_1300_1350 h_1350_1400 h_1400_1450 h_1450_1500 h_1500_1550
#1 0 0 0 0 0 0
#2 0 0 0 0 0 0
#3 0 0 0 0 0 0
#4 0 0 0 0 0 0
#5 0 0 0 0 0 0
# h_1550_1600 h_1600_1650 h_1650_1700 h_1700_1750 h_1750_1800 h_1800_1850
#1 0 0 0 0 0 0
#2 0 0 0 0 0 0
#3 0 0 0 0 0 0
#4 0 0 0 0 0 0
#5 0 0 0 0 0 0
# h_1850_1900 h_1900_1950 h_1950_2000 height_index
#1 0 0 0 h_400_450
#2 0 0 0 h_0_400
#3 0 0 1 h_1950_2000
#4 0 0 0 h_650_700
#5 0 0 0 h_0_400
我有一个数据框 df
存储了不同年份几千株植物的平均高度(以厘米为单位):
Name Year Height
Plant1 2010 440
Plant2 2011 60
Plant1 2011 1980
Plant3 2013 650
Plant4 2016 210
我想执行以下操作:
a) 为 400 厘米和 2000 厘米(含)之间的每个 50 厘米的高度间隔创建一个变量,其中两个变量 <400 和 >2000。 df
应该是这样的:
Name Year Height h_0_400 h_400 h_450 h_500 h_550 etc.
Plant1 2010 440
Plant2 2011 60
Plant1 2011 1980
Plant3 2013 640
Plant4 2016 210
b) 根据实际height
:
Name Year Height h_0_400 h_400 h_450 h_500 h_550 etc.
Plant1 2010 440 0 1 0 0 0
Plant2 2011 60 1 0 0 0 0
Plant1 2011 1980 0 0 0 0 0
Plant3 2013 640 0 0 0 0 0
Plant4 2016 210 1 0 0 0 0
c) 添加一个变量,指示条目属于 heights
的哪个类别
Name Year Height h_0_400 h_400 h_450 h_500 h_550 etc. height_index
Plant1 2010 440 0 1 0 0 0 h_400
Plant2 2011 60 1 0 0 0 0 h_0_400
Plant1 2011 1980 0 0 0 0 0 h_1950
Plant3 2013 640 0 0 0 0 0 h_600
Plant4 2016 210 1 0 0 0 0 h_0_400
我不确定如何解决这个问题,如果有任何见解,我将不胜感激。到目前为止,我已经尝试使用 seq(400,2000,by=1)
并删除不需要的值,但这似乎效率很低。
我很乐意使用任何包。非常感谢!
一个选项是使用cut
(或使用findInterval
)创建一个变量组,然后重塑为宽格式
library(dplyr)
library(tidyr)
library(stringr)
out <- df %>%
# // create grouping variable with cut based on the Height
mutate(ind = cut(Height, breaks = c(-Inf, c(0, seq(400, 2000,
by = 50 ))), labels = c('h_0_400',
str_c('h_', seq(400, 2000, by = 50)))), height_index = ind, n = 1) %>%
# // reshape to wide format
pivot_wider(names_from = ind, values_from = n, values_fill= list(n = 0))
# // missing columns are created with setdiff and assigned to 0
out[setdiff(levels(out$height_index), out$height_index)] <- 0
数据
df <- structure(list(Name = c("Plant1", "Plant2", "Plant1", "Plant3",
"Plant4"), Year = c(2010L, 2011L, 2011L, 2013L, 2016L), Height = c(340L,
60L, 1980L, 650L, 210L)), class = "data.frame", row.names = c(NA,
-5L))
这是在 base R 中执行此操作的一种方法:
#Create a sequence
vals <- seq(400, 2000, 50)
#Create column names
cols <- paste('h', c(0, vals[-length(vals)]), vals, sep = "_")
#Initialize new columns with 0
df[cols] <- 0
#Find which interval the height lies
inds <- findInterval(df$Height, vals) + 1
#Make the respective column as 1
df[cols][cbind(1:nrow(df), inds)] <- 1
#Create a new column giving the column name
df$height_index <- cols[inds]
最终数据框如下所示:
df
# Name Year Height h_0_400 h_400_450 h_450_500 h_500_550 h_550_600
#1 Plant1 2010 440 0 1 0 0 0
#2 Plant2 2011 60 1 0 0 0 0
#3 Plant1 2011 1980 0 0 0 0 0
#4 Plant3 2013 650 0 0 0 0 0
#5 Plant4 2016 210 1 0 0 0 0
# h_600_650 h_650_700 h_700_750 h_750_800 h_800_850 h_850_900 h_900_950
#1 0 0 0 0 0 0 0
#2 0 0 0 0 0 0 0
#3 0 0 0 0 0 0 0
#4 0 1 0 0 0 0 0
#5 0 0 0 0 0 0 0
# h_950_1000 h_1000_1050 h_1050_1100 h_1100_1150 h_1150_1200 h_1200_1250
#1 0 0 0 0 0 0
#2 0 0 0 0 0 0
#3 0 0 0 0 0 0
#4 0 0 0 0 0 0
#5 0 0 0 0 0 0
# h_1250_1300 h_1300_1350 h_1350_1400 h_1400_1450 h_1450_1500 h_1500_1550
#1 0 0 0 0 0 0
#2 0 0 0 0 0 0
#3 0 0 0 0 0 0
#4 0 0 0 0 0 0
#5 0 0 0 0 0 0
# h_1550_1600 h_1600_1650 h_1650_1700 h_1700_1750 h_1750_1800 h_1800_1850
#1 0 0 0 0 0 0
#2 0 0 0 0 0 0
#3 0 0 0 0 0 0
#4 0 0 0 0 0 0
#5 0 0 0 0 0 0
# h_1850_1900 h_1900_1950 h_1950_2000 height_index
#1 0 0 0 h_400_450
#2 0 0 0 h_0_400
#3 0 0 1 h_1950_2000
#4 0 0 0 h_650_700
#5 0 0 0 h_0_400