按因子水平子设置观察值超过 x 个观察值
Sub setting observations by factor levels with more than x observations
我有一个数据集,其中一个因素有很多级别 (+/- 140),因此(我认为)lm
函数失败:
Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) :
contrasts can be applied only to factors with 2 or more levels
我想做的是对 lm 函数进行子集化,仅使用超过 x
个观察值的因子水平。
例如,此 data.table 有一个因子 (some_NA_factor
),其中级别 1, 2 , 4, 5
有 17 个观察值,级别 3
有 16 个。我想直接(在 lm-function
中)对数据集进行子集化,使其仅使用因子水平超过 16(至少 17)个观察值的观察值:
set.seed(1)
library(data.table)
DT <- data.table(panelID = sample(50,50), # Creates a panel ID
Country = c(rep("A",30),rep("B",50), rep("C",20)),
some_NA = sample(0:5, 6),
some_NA_factor = sample(0:5, 6),
Group = c(rep(1,20),rep(2,20),rep(3,20),rep(4,20),rep(5,20)),
Time = rep(seq(as.Date("2010-01-03"), length=20, by="1 month") - 1,5),
norm = round(runif(100)/10,2),
Income = sample(100,100),
Happiness = sample(10,10),
Sex = round(rnorm(10,0.75,0.3),2),
Age = round(rnorm(10,0.75,0.3),2),
Educ = round(rnorm(10,0.75,0.3),2))
DT [, uniqueID := .I] # Creates a unique ID
DT[DT == 0] <- NA #
DT$some_NA_factor <- factor(DT$some_NA_factor)
table(DT$some_NA_factor)
lm
中的正常子集语法例如如下所示:
lm(Happiness ~ Income + some_NA_factor, data=DT, subset=(Income > 50 & Happiness < 5))
如何调整语法来检查因子水平的观察结果?
考虑使用 table
调用中的 Filter
和 isTRUE
构建布尔向量,然后 运行 子集中的 %in%
参数:
boolean_vec <- Filter(isTRUE, table(DT$some_NA_factor) > 16)
boolean_vec
# 1 2 4 5
# TRUE TRUE TRUE TRUE
lm(Happiness ~ Income + some_NA_factor, data=DT,
subset=(Income > 50 & Happiness < 5 & some_NA_factor %in% names(boolean_vec)))
或者使用 dplyr 中的 %>% 函数,这样您就不必单独存储每个子集:
library(dplyr)
DT %>% filter(!is.na(some_NA_factor)) %>%
count(some_NA_factor) %>% filter(n > 16) %>% inner_join(DT, by =
'some_NA_factor') %>%
lm(Happiness ~ Income + some_NA_factor, data = .)
我有一个数据集,其中一个因素有很多级别 (+/- 140),因此(我认为)lm
函数失败:
Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) :
contrasts can be applied only to factors with 2 or more levels
我想做的是对 lm 函数进行子集化,仅使用超过 x
个观察值的因子水平。
例如,此 data.table 有一个因子 (some_NA_factor
),其中级别 1, 2 , 4, 5
有 17 个观察值,级别 3
有 16 个。我想直接(在 lm-function
中)对数据集进行子集化,使其仅使用因子水平超过 16(至少 17)个观察值的观察值:
set.seed(1)
library(data.table)
DT <- data.table(panelID = sample(50,50), # Creates a panel ID
Country = c(rep("A",30),rep("B",50), rep("C",20)),
some_NA = sample(0:5, 6),
some_NA_factor = sample(0:5, 6),
Group = c(rep(1,20),rep(2,20),rep(3,20),rep(4,20),rep(5,20)),
Time = rep(seq(as.Date("2010-01-03"), length=20, by="1 month") - 1,5),
norm = round(runif(100)/10,2),
Income = sample(100,100),
Happiness = sample(10,10),
Sex = round(rnorm(10,0.75,0.3),2),
Age = round(rnorm(10,0.75,0.3),2),
Educ = round(rnorm(10,0.75,0.3),2))
DT [, uniqueID := .I] # Creates a unique ID
DT[DT == 0] <- NA #
DT$some_NA_factor <- factor(DT$some_NA_factor)
table(DT$some_NA_factor)
lm
中的正常子集语法例如如下所示:
lm(Happiness ~ Income + some_NA_factor, data=DT, subset=(Income > 50 & Happiness < 5))
如何调整语法来检查因子水平的观察结果?
考虑使用 table
调用中的 Filter
和 isTRUE
构建布尔向量,然后 运行 子集中的 %in%
参数:
boolean_vec <- Filter(isTRUE, table(DT$some_NA_factor) > 16)
boolean_vec
# 1 2 4 5
# TRUE TRUE TRUE TRUE
lm(Happiness ~ Income + some_NA_factor, data=DT,
subset=(Income > 50 & Happiness < 5 & some_NA_factor %in% names(boolean_vec)))
或者使用 dplyr 中的 %>% 函数,这样您就不必单独存储每个子集:
library(dplyr)
DT %>% filter(!is.na(some_NA_factor)) %>%
count(some_NA_factor) %>% filter(n > 16) %>% inner_join(DT, by =
'some_NA_factor') %>%
lm(Happiness ~ Income + some_NA_factor, data = .)