分析 R 中的复选框数据(即列数)的最佳方法是什么,其中每个选择都是它们自己的列,而未选择的选择是 NAs?

What is the best way to analyse checkbox data (ie. column counts) in R where choices are each their own column and unchosen choices are NAs?

Qualtrics 代码问题选择的调查输出,其中可以记录多个响应,例如 race/ethnicity 人口统计(例如下面),我无法想出一个简单的解决方案分析。它在每个选项(在其自己的列中)下记录每行选中的复选框,未选中的选项保持空白。我已经决定,一个好的起点是计算每个选择的非“NA”。但是,它并没有按照我的计划进行,并且对可用解决方案的严格搜索也没有用。我找到了一种使用 apply 获取列数的方法,但处理输出仍然有点笨拙。我有一个包含许多列的数据框,需要以这种方式进行分析,因此我使用 grep 函数 select 需要选择计数的相关列。

数据:

structure(list(race_White = c("White", NA, NA, "White", NA, NA, 
"White", "White", NA, "White", "White", "White", "White", "White", 
"White", "White", "White", "White", "White", "White", NA, "White", 
"White", "White", NA), `race_Black or African American` = c(NA, 
NA, "Black or African American", NA, NA, "Black or African American", 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, "Black or African American"), `race_American Indian or Alaska Native` = c(NA, 
NA, NA, NA, NA, NA, NA, NA, "American Indian or Alaska Native", 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA
), race_Asian = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, "Asian", NA, NA, NA, NA), 
    `race_Middle Eastern or North African` = c(NA_character_, 
    NA_character_, NA_character_, NA_character_, NA_character_, 
    NA_character_, NA_character_, NA_character_, NA_character_, 
    NA_character_, NA_character_, NA_character_, NA_character_, 
    NA_character_, NA_character_, NA_character_, NA_character_, 
    NA_character_, NA_character_, NA_character_, NA_character_, 
    NA_character_, NA_character_, NA_character_, NA_character_
    ), `race_Hispanic, Latino or Spanish` = c(NA, "Hispanic, Latino or Spanish", 
    NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
    NA, NA, NA, NA, NA, NA, NA, NA), `race_Native Hawaiian or Pacific Islander` = c(NA_character_, 
    NA_character_, NA_character_, NA_character_, NA_character_, 
    NA_character_, NA_character_, NA_character_, NA_character_, 
    NA_character_, NA_character_, NA_character_, NA_character_, 
    NA_character_, NA_character_, NA_character_, NA_character_, 
    NA_character_, NA_character_, NA_character_, NA_character_, 
    NA_character_, NA_character_, NA_character_, NA_character_
    ), `race_ Prefer not to share` = c(NA, NA, NA, NA, "Prefer not to share", 
    NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
    NA, NA, NA, NA, NA), race_Other = c(NA_character_, NA_character_, 
    NA_character_, NA_character_, NA_character_, NA_character_, 
    NA_character_, NA_character_, NA_character_, NA_character_, 
    NA_character_, NA_character_, NA_character_, NA_character_, 
    NA_character_, NA_character_, NA_character_, NA_character_, 
    NA_character_, NA_character_, NA_character_, NA_character_, 
    NA_character_, NA_character_, NA_character_), education_level = structure(c(3L, 
    2L, 5L, 4L, 6L, 3L, 6L, 2L, 3L, 3L, 5L, 2L, 5L, 5L, 3L, 3L, 
    5L, 2L, 5L, 5L, 5L, 3L, 3L, 3L, 5L), .Label = c("Less than high school degree", 
    "High school graduate (high school diploma or equivalent)", 
    "Some college but no degree", "Associate's degree (2-year)", 
    "Bachelor's degree (4-year)", "Master's degree", "Doctoral/Professional degree (PhD, MD, JD)", 
    "Other/Prefer not to share"), class = "factor"), age = c(74, 
    43, NA, 37, 61, 64, NA, NA, 45, NA, NA, 21, NA, NA, 52, 43, 
    43, NA, 65, 42, NA, 27, 35, NA, 46)), row.names = c(NA, -25L
), class = c("tbl_df", "tbl", "data.frame"))

我已经使用 grep 来选择列号,我想使用以下命令来计算选项:

race<-c(grep("race", colnames(data)))

然后,我还使用了列名,以防公式需要名称而不是数字

racenames<-colnames(data[race])

在我创建这些 selection 之后,我尝试使用以下方法获取某种 table 不等于 "" 的行的计数,(没有成功)

racecounts <- sapply(data[race],FUN = function(x){length(x[x!=""])})
racecounts

这基本上总结了列中的每一行,而不是我希望的非空行。 所以我只尝试了一个简单的应用函数,它确实有效:

racecounts2 <- apply(data[race], 2, table)
racecounts2

这有效,然后我必须将其转换为 prop.table 以获得与 kable

一起使用的比例
racecounts2<-prop.table(racecounts2)
 racecounts2%>%
     kbl() %>%
     kable_material_dark()

我很好奇是否有人找到了 alternate/better 处理这种数据格式的方法? 我愿意尝试任何不同的东西,这个看起来很笨拙,它的输出有点让人难以想象。 如果能找到一种方法来处理这些数据,让 ranking/plotting 等更容易地向前推进,那就太好了。

所以我很好奇社区会怎么做。

您可以使用 !is.na 计算非 NA 值的数量,如下所示:

colSums(!is.na(data[race]))

或者,使用 dplyr 语法和 tidyr::pivot_longer 使其看起来更像 table:

data %>% select(starts_with("race")) %>% 
  summarise(across(everything(), ~sum(!is.na(.x)))) %>% 
  pivot_longer(cols=everything(), names_to = "race", values_to = "count",
               names_transform = list(race = \(x) str_remove(x, "race_")))

# A tibble: 9 x 2
  race                                  count
  <chr>                                 <int>
1 "White"                                  18
2 "Black or African American"               3
3 "American Indian or Alaska Native"        1
4 "Asian"                                   1
5 "Middle Eastern or North African"         0
6 "Hispanic, Latino or Spanish"             1
7 "Native Hawaiian or Pacific Islander"     0
8 " Prefer not to share"                    1
9 "Other"                                   0