基于变量前缀的子集数据

Subset data based on variable prefix

我有一个大型数据集,其中一个问题的答案分布在各个列中。但是,如果列属于一起,则它们共享相同的前缀。我想知道如何创建基于前缀排序的每个问题的子数据集。

这是一个示例数据集。我希望获得一个高效且易于适应的解决方案来创建仅包含问题一、二或三的值的数据集。

structure(list(ID = c(1, 2, 3, 4, 5, 6, 7, 8), Question1a = c(1, 
1, NA, NA, 1, 1, 1, NA), Question1b = c(NA, 1, NA, 1, NA, 1, 
NA, 1), Question1c = c(1, 1, NA, NA, 1, NA, NA, NA), Question2a = c(1, 
NA, NA, NA, 1, 1, NA, NA), Question2b = c(NA, 1, NA, 1, NA, NA, 
NA, NA), Question3a = c(NA, NA, NA, NA, 1, 1, 1, NA), Question3b = c(NA, 
NA, 1, 1, NA, NA, NA, NA)), class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -8L))

我认为潜在的问题是关于 data-formats。 这里有一些:

library(tidyverse)
structure(
  list(
    ID = c(1, 2, 3, 4, 5, 6, 7, 8),
    Question1a = c(1,
                   1, NA, NA, 1, 1, 1, NA),
    Question1b = c(NA, 1, NA, 1, NA, 1,
                   NA, 1),
    Question1c = c(1, 1, NA, NA, 1, NA, NA, NA),
    Question2a = c(1,
                   NA, NA, NA, 1, 1, NA, NA),
    Question2b = c(NA, 1, NA, 1, NA, NA,
                   NA, NA),
    Question3a = c(NA, NA, NA, NA, 1, 1, 1, NA),
    Question3b = c(NA,
                   NA, 1, 1, NA, NA, NA, NA)
  ),
  class = c("tbl_df", "tbl", "data.frame"),
  row.names = c(NA, -8L)
) -> square_df

square_df %>% 
  pivot_longer(-ID, 
               names_to = c("Question", "Item"),
               names_pattern = "Question(\d+)(\w+)") ->
  long_df
long_df
#> # A tibble: 56 × 4
#>       ID Question Item  value
#>    <dbl> <chr>    <chr> <dbl>
#>  1     1 1        a         1
#>  2     1 1        b        NA
#>  3     1 1        c         1
#>  4     1 2        a         1
#>  5     1 2        b        NA
#>  6     1 3        a        NA
#>  7     1 3        b        NA
#>  8     2 1        a         1
#>  9     2 1        b         1
#> 10     2 1        c         1
#> # … with 46 more rows

long_df %>% 
  na.omit(value) ->
  sparse_long_df
sparse_long_df
#> # A tibble: 22 × 4
#>       ID Question Item  value
#>    <dbl> <chr>    <chr> <dbl>
#>  1     1 1        a         1
#>  2     1 1        c         1
#>  3     1 2        a         1
#>  4     2 1        a         1
#>  5     2 1        b         1
#>  6     2 1        c         1
#>  7     2 2        b         1
#>  8     3 3        b         1
#>  9     4 1        b         1
#> 10     4 2        b         1
#> # … with 12 more rows

sparse_long_df %>% 
  nest(data = c(ID, Item, value)) ->
  nested_long_df
nested_long_df
#> # A tibble: 3 × 2
#>   Question data             
#>   <chr>    <list>           
#> 1 1        <tibble [12 × 3]>
#> 2 2        <tibble [5 × 3]> 
#> 3 3        <tibble [5 × 3]>

reprex package (v2.0.1)

创建于 2022-05-12

您可以使用 sapply 和一个函数:

list_data <- sapply(c("Question1", "Question2", "Question3"),
      function(x) df[startsWith(names(df),x)], simplify = FALSE)

这会将所有内容存储在列表中。要将全局环境中的单个数据集作为单个对象获取,请使用:

list2env(list_data, globalenv())

输出

# $Question1
# # A tibble: 8 × 3
# Question1a Question1b Question1c
# <dbl>      <dbl>      <dbl>
#   1          1         NA          1
# 2          1          1          1
# 3         NA         NA         NA
# 4         NA          1         NA
# 5          1         NA          1
# 6          1          1         NA
# 7          1         NA         NA
# 8         NA          1         NA
# 
# $Question2
# # A tibble: 8 × 2
# Question2a Question2b
# <dbl>      <dbl>
#   1          1         NA
# 2         NA          1
# 3         NA         NA
# 4         NA          1
# 5          1         NA
# 6          1         NA
# 7         NA         NA
# 8         NA         NA
# 
# $Question3
# # A tibble: 8 × 2
# Question3a Question3b
# <dbl>      <dbl>
#   1         NA         NA
# 2         NA         NA
# 3         NA          1
# 4         NA          1
# 5          1         NA
# 6          1         NA
# 7          1         NA
# 8         NA         NA

您还可以使用 map 将每个数据帧存储在列表中,例如

 library(purrr)
  # 3 = number of questions
  map(c(1:3), 
     
     function(x){
       quest <- paste0("Question",x)
       select(df, ID, starts_with(quest))
     })

输出:

[[1]]
# A tibble: 8 x 4
     ID Question1a Question1b Question1c
  <dbl>      <dbl>      <dbl>      <dbl>
1     1          1         NA          1
2     2          1          1          1
3     3         NA         NA         NA
4     4         NA          1         NA
5     5          1         NA          1
6     6          1          1         NA
7     7          1         NA         NA
8     8         NA          1         NA

[[2]]
# A tibble: 8 x 3
     ID Question2a Question2b
  <dbl>      <dbl>      <dbl>
1     1          1         NA
2     2         NA          1
3     3         NA         NA
4     4         NA          1
5     5          1         NA
6     6          1         NA
7     7         NA         NA
8     8         NA         NA

[[3]]
# A tibble: 8 x 3
     ID Question3a Question3b
  <dbl>      <dbl>      <dbl>
1     1         NA         NA
2     2         NA         NA
3     3         NA          1
4     4         NA          1
5     5          1         NA
6     6          1         NA
7     7          1         NA
8     8         NA         NA

我使用 dplyr 包找到了一个非常直观的解决方案,使用 selectstarts_with 命令。或者,您也可以将 starts_with 命令替换为 contains,如果您不是通过前缀而是通过其他一些共同特征来识别相似的变量。

Q1 <- Survey %>%
             select(
             starts_with("Question1")
             )
Q2 <- Survey %>%
             select(
             starts_with("Question2")
             )

Q3 <- Survey %>%
             select(
             starts_with("Question3")
             )