R 面板数据 - 获取当前和所有连续波之间的公共变量

R Panel Data - Get Common Variables between Current and All Consecutive Waves

我们目前正在做一个爱好项目,有大量的面板数据(有很多波,即测量时间点),这可能非常具有挑战性。为了获得概览,一个想法是找到从当前波到所有连续波的共同变量。

举例说明:

Wave 1: Var1, Var2, Var3, Var4    
Wave 2: Var1, Var2, Var4, Var5   
Wave 3: Var1, Var5, Var6, Var7

此处Wave 1:Var1 和Var2 与Wave2 相同,只有Var1 与Wave3Wave2Wave3 的共同点:Var1 和 Var5

期望输出

小标题(或 data.frame)在行中显示感兴趣的波,并在每个连续波的列中显示哪些变量是共同的。

Starting Wave1 Wave2 Wave3
Wave1 - Var1, Var2 Var1
Wave2 - - Var1, Var5
Wave3 - - -

模拟数据:

pacman::p_load(tidyverse)
wave1 <- tibble(
  id = seq_along(1:100),
  a = runif(100, 0, 100),
  o = runif(100, 0, 100),
  x = runif(100, 0, 100),
  y = runif(100, 0, 100),
  z = runif(100, 0, 100)
)
# In wave2 some observations drop out & some new observations are added
wave2 <- tibble(
  id = seq_along(1:150),
  a = runif(150, 0, 100),
  b = runif(150, 0, 100),
  c = runif(150, 0, 100),
  d = runif(150, 0, 100),
  e = runif(150, 0, 100),
  x = runif(150, 0, 100),
  y = runif(150, 0, 100)
)
# Simulation of Dropout
wave2 %>%
  filter(!id %in% sample(1:150, 23)) -> wave2

# Same with Wave 3
wave3 <- tibble(
  id = c(wave2 %>% pull(id),151:200),
  a = runif(nrow(wave2) + 50, 0, 100),
  b = runif(nrow(wave2) + 50, 0, 100),
  c = runif(nrow(wave2) + 50, 0, 100),
  
  i = runif(nrow(wave2) + 50, 0, 100),
  j = runif(nrow(wave2) + 50, 0, 100),
  k = runif(nrow(wave2) + 50, 0, 100),
  l = runif(nrow(wave2) + 50, 0, 100),
  
  x = runif(nrow(wave2) + 50, 0, 100),
  z = runif(nrow(wave2) + 50, 0, 100)
)
# Simulation of Dropout
wave3 %>%
  filter(!id %in% sample(1:200, 33)) -> wave3

# Same with Wave 4
wave4 <- tibble(
  id = c(wave3 %>% pull(id),201:300),
  a = runif(nrow(wave3) + 100, 0, 100),
  c = runif(nrow(wave3) + 100, 0, 100),
  
  i = runif(nrow(wave3) + 100, 0, 100),
  j = runif(nrow(wave3) + 100, 0, 100),
  l = runif(nrow(wave3) + 100, 0, 100),
  
  z = runif(nrow(wave3) + 100, 0, 100)
)

# Simulation of Dropout in Wave 4
wave4 %>%
  filter(!id %in% sample(1:200, 41)) -> wave4

在模拟数据中,例如变量 a 将出现在所有波中。

到目前为止我得到了什么

用 for-loop 遍历波的名称(通过 ls 和 regex-pattern 获得),获取当前位置和下一个位置,从中获取数据get 的环境并读出 colnames。使用 intersect 获取当前(就 for 循环而言)和下一波之间的公共列名称。将所有内容保存到 tibble(初始化为空)。

最后 group_by 并总结以在一栏中获得所有内容

##### Get Common Variables between all waves
# Get names of Tibbles (data) from environment
names_waves <- ls(pattern = "wave\d+") %>% str_sort(numeric = TRUE)
waves_lagged_common2 <- tibble(wave = character(),common_vars = character())
for (wave in names_waves) {
  cur_pos <- names_waves %>% match(x = wave)
  print(cur_pos)
  if (cur_pos != length(names_waves)) {
    cur_wave_names <- get(wave) %>% names()
    next_wave_names <- get(names_waves[cur_pos + 1]) %>% names()
    intersect(cur_wave_names,next_wave_names) -> common_vars
    waves_lagged_common2 <- waves_lagged_common2 %>% add_row(wave = wave, common_vars = common_vars)
  }
}
# Merge the rows with group_by and summarise
waves_lagged_common2 %>% 
  group_by(wave) %>% 
  summarize(common_vars = paste(common_vars, collapse = ", "))

当前输出:

# A tibble: 3 x 2
  wave  common_vars         
  <chr> <chr>               
1 wave1 id, a, x, y         
2 wave2 id, a, b, c, x      
3 wave3 id, a, c, i, j, l, z

进一步努力实现预期产出

在 for-loop 内实施额外的 for-loop 以获得当前波 (cur_pos) 所有连续波。并与 add_row

合作
common_vars_matrix <- tibble(Wave = names_waves) %>%
  column_to_rownames("Wave")
for (wave in names_waves) {
  cur_pos <- names_waves %>% match(x = wave)
  print(cur_pos)
  if (cur_pos != length(names_waves)) {
    cur_wave_names <- get(wave) %>% names()
    next_row_to_add <- c()
    for (next_wave in names_waves[cur_pos+1:length(names_waves)]) {
      next_wave_names <- get(next_wave) %>% names()
      intersect(cur_wave_names,next_wave_names) -> common_vars
      print(next_wave)
      # print(common_vars)
      next_row_to_add <- c(next_row_to_add,common_vars)
      print(next_row_to_add)
    }
  }
}

然而所有这些 for-loop 感觉都不是很整洁。

我也 运行 不喜欢索引运算符。在这一点上,我想知道使用循环和“自动化”过程是否是个好主意,或者像这样工作是否更容易/更具可读性

wave1 %>% names -> names_wave1
wave2 %>% names -> names_wave2
wave3 %>% names -> names_wave3
wave4 %>% names -> names_wave4

common_1_2 <- intersect(names_wave1,names_wave2)
common_1_3 <- intersect(names_wave1,names_wave3)
common_1_4 <- intersect(names_wave1,names_wave4)

common_vars_matrix <- tibble(wave2=paste(common_1_2,collapse=","),
                             wave3=paste(common_1_3, collapse=","),
                             wave4=paste(common_1_4, collapse=","))

common_2_3 <- intersect(names_wave2,names_wave3)
common_2_4 <- intersect(names_wave2,names_wave4)

common_vars_matrix <- common_vars_matrix %>% add_row(wave2="-",
                               wave3=paste(common_2_3, collapse=","),
                               wave4=paste(common_2_4, collapse=","))
# And so forth

给定你想要的输出,这似乎给了你你想要的。

# Initialize a data frame with row names of your waves
df <- data.frame(row.names = ls(pattern = "wave"))

# Add in the dashes that you want
for (row in rownames(df)) {
  df[, row] = "-"
}

# Outer loop will loop through rows
for (row in rownames(df)) {
  # Get the data frame corresponding to that row
  row_df <- get(row)
  # Inner loop will loop through the columns
  for (col in colnames(df)) {
    # Get the data frame corresponding to the column
    col_df <- get(col)
    # If they are the same (ex. wave1/ wave1), leave it alone
    if (row != col) {
      # Find which column names are in both
      var_both <- colnames(row_df)[which(colnames(row_df) %in% colnames(col_df))]
      # Concatenate and add the value as a string at the intersection
      df[row, col] <- paste(var_both, collapse = ", ")
    }
  }
}

但是如果你有很多数据帧,这在性能方面可能不是最有效的,因为它包含一个嵌套循环(即 O(n2))。

无论如何,输出是:

            wave1          wave2                wave3                wave4
wave1           -    id, a, x, y          id, a, x, z             id, a, z
wave2 id, a, x, y              -       id, a, b, c, x             id, a, c
wave3 id, a, x, z id, a, b, c, x                    - id, a, c, i, j, l, z
wave4    id, a, z       id, a, c id, a, c, i, j, l, z                    -