剔除少于两个因子的变量

Weed out variables with less than two factors

我的数据框中的变量有字符观察(不确定这是否是正确的表达方式,基本上当我拉起结构时数据被列为 "chr")。

我想先将所有内容转换为因子,然后检查因子水平的数量。一旦它们成为因素,我只想继续使用数据框中具有两个或更多级别的变量。

这是我目前的想法。我知道 for 循环在 R 中是一种禁忌,但我是新手,使用它对我来说很有意义。

x = as.character(c("Not Sampled", "Not Sampled", "Y", "N"))
y = as.character(c("Not Sampled", "Not Sampled", "Not Sampled", "Not Sampled"))
z = as.character(c("Y", "N", "Not Sampled", "Y"))
df = data.frame(x, y, z)

for i in df:
  df$Response = as.factor(df[,i]) #create new variable in dataframe
  df$Response = df@data[sapply ....  #where I think I can separate out the variables I want and the variables I don't want

  m1 = lm(response ~ 1) #next part where I want only the selected variables

我知道解决方案可能要复杂得多,但这是我初出茅庐的尝试。

library(dplyr)

df <- df %>% lapply(factor) %>% data.frame()
df[ , sapply(df, n_distinct) >= 2]

默认的 data.frame 方法将字符串转换为因子,因此在这种情况下不需要额外的转换。 lapply 更适合级别比较,因为如果长度相同,sapply 将尝试将 return 值简化为矩阵。

df = data.frame(x, y, z)

## Already factors,  use sapply(df, typeof) to see underlying representation
sapply(df, class)  
#        x        y        z 
# "factor" "factor" "factor" 

## These are the indicies with > 2 levels
lengths(lapply(df, levels)) > 2
#    x     y     z 
# TRUE FALSE  TRUE 

## Extract only those columns
df[lengths(lapply(df, levels)) > 2]
df[, sapply(df, function(x) length(levels(x)) >= 2)]