R 面板数据 - 获取当前和所有连续波之间的公共变量
R Panel Data - Get Common Variables between Current and All Consecutive Waves
我们目前正在做一个爱好项目,有大量的面板数据(有很多波,即测量时间点),这可能非常具有挑战性。为了获得概览,一个想法是找到从当前波到所有连续波的共同变量。
举例说明:
Wave 1: Var1, Var2, Var3, Var4
Wave 2: Var1, Var2, Var4, Var5
Wave 3: Var1, Var5, Var6, Var7
此处Wave 1:Var1 和Var2 与Wave2 相同,只有Var1 与Wave3。 Wave2 与 Wave3 的共同点:Var1 和 Var5
期望输出
小标题(或 data.frame)在行中显示感兴趣的波,并在每个连续波的列中显示哪些变量是共同的。
Starting
Wave1
Wave2
Wave3
Wave1
-
Var1, Var2
Var1
Wave2
-
-
Var1, Var5
Wave3
-
-
-
模拟数据:
pacman::p_load(tidyverse)
wave1 <- tibble(
id = seq_along(1:100),
a = runif(100, 0, 100),
o = runif(100, 0, 100),
x = runif(100, 0, 100),
y = runif(100, 0, 100),
z = runif(100, 0, 100)
)
# In wave2 some observations drop out & some new observations are added
wave2 <- tibble(
id = seq_along(1:150),
a = runif(150, 0, 100),
b = runif(150, 0, 100),
c = runif(150, 0, 100),
d = runif(150, 0, 100),
e = runif(150, 0, 100),
x = runif(150, 0, 100),
y = runif(150, 0, 100)
)
# Simulation of Dropout
wave2 %>%
filter(!id %in% sample(1:150, 23)) -> wave2
# Same with Wave 3
wave3 <- tibble(
id = c(wave2 %>% pull(id),151:200),
a = runif(nrow(wave2) + 50, 0, 100),
b = runif(nrow(wave2) + 50, 0, 100),
c = runif(nrow(wave2) + 50, 0, 100),
i = runif(nrow(wave2) + 50, 0, 100),
j = runif(nrow(wave2) + 50, 0, 100),
k = runif(nrow(wave2) + 50, 0, 100),
l = runif(nrow(wave2) + 50, 0, 100),
x = runif(nrow(wave2) + 50, 0, 100),
z = runif(nrow(wave2) + 50, 0, 100)
)
# Simulation of Dropout
wave3 %>%
filter(!id %in% sample(1:200, 33)) -> wave3
# Same with Wave 4
wave4 <- tibble(
id = c(wave3 %>% pull(id),201:300),
a = runif(nrow(wave3) + 100, 0, 100),
c = runif(nrow(wave3) + 100, 0, 100),
i = runif(nrow(wave3) + 100, 0, 100),
j = runif(nrow(wave3) + 100, 0, 100),
l = runif(nrow(wave3) + 100, 0, 100),
z = runif(nrow(wave3) + 100, 0, 100)
)
# Simulation of Dropout in Wave 4
wave4 %>%
filter(!id %in% sample(1:200, 41)) -> wave4
在模拟数据中,例如变量 a
将出现在所有波中。
到目前为止我得到了什么
用 for-loop 遍历波的名称(通过 ls
和 regex-pattern 获得),获取当前位置和下一个位置,从中获取数据get
的环境并读出 colnames。使用 intersect
获取当前(就 for 循环而言)和下一波之间的公共列名称。将所有内容保存到 tibble(初始化为空)。
最后 group_by 并总结以在一栏中获得所有内容
##### Get Common Variables between all waves
# Get names of Tibbles (data) from environment
names_waves <- ls(pattern = "wave\d+") %>% str_sort(numeric = TRUE)
waves_lagged_common2 <- tibble(wave = character(),common_vars = character())
for (wave in names_waves) {
cur_pos <- names_waves %>% match(x = wave)
print(cur_pos)
if (cur_pos != length(names_waves)) {
cur_wave_names <- get(wave) %>% names()
next_wave_names <- get(names_waves[cur_pos + 1]) %>% names()
intersect(cur_wave_names,next_wave_names) -> common_vars
waves_lagged_common2 <- waves_lagged_common2 %>% add_row(wave = wave, common_vars = common_vars)
}
}
# Merge the rows with group_by and summarise
waves_lagged_common2 %>%
group_by(wave) %>%
summarize(common_vars = paste(common_vars, collapse = ", "))
当前输出:
# A tibble: 3 x 2
wave common_vars
<chr> <chr>
1 wave1 id, a, x, y
2 wave2 id, a, b, c, x
3 wave3 id, a, c, i, j, l, z
进一步努力实现预期产出
在 for-loop 内实施额外的 for-loop 以获得当前波 (cur_pos
) 所有连续波。并与 add_row
合作
common_vars_matrix <- tibble(Wave = names_waves) %>%
column_to_rownames("Wave")
for (wave in names_waves) {
cur_pos <- names_waves %>% match(x = wave)
print(cur_pos)
if (cur_pos != length(names_waves)) {
cur_wave_names <- get(wave) %>% names()
next_row_to_add <- c()
for (next_wave in names_waves[cur_pos+1:length(names_waves)]) {
next_wave_names <- get(next_wave) %>% names()
intersect(cur_wave_names,next_wave_names) -> common_vars
print(next_wave)
# print(common_vars)
next_row_to_add <- c(next_row_to_add,common_vars)
print(next_row_to_add)
}
}
}
然而所有这些 for-loop 感觉都不是很整洁。
我也 运行 不喜欢索引运算符。在这一点上,我想知道使用循环和“自动化”过程是否是个好主意,或者像这样工作是否更容易/更具可读性
wave1 %>% names -> names_wave1
wave2 %>% names -> names_wave2
wave3 %>% names -> names_wave3
wave4 %>% names -> names_wave4
common_1_2 <- intersect(names_wave1,names_wave2)
common_1_3 <- intersect(names_wave1,names_wave3)
common_1_4 <- intersect(names_wave1,names_wave4)
common_vars_matrix <- tibble(wave2=paste(common_1_2,collapse=","),
wave3=paste(common_1_3, collapse=","),
wave4=paste(common_1_4, collapse=","))
common_2_3 <- intersect(names_wave2,names_wave3)
common_2_4 <- intersect(names_wave2,names_wave4)
common_vars_matrix <- common_vars_matrix %>% add_row(wave2="-",
wave3=paste(common_2_3, collapse=","),
wave4=paste(common_2_4, collapse=","))
# And so forth
给定你想要的输出,这似乎给了你你想要的。
# Initialize a data frame with row names of your waves
df <- data.frame(row.names = ls(pattern = "wave"))
# Add in the dashes that you want
for (row in rownames(df)) {
df[, row] = "-"
}
# Outer loop will loop through rows
for (row in rownames(df)) {
# Get the data frame corresponding to that row
row_df <- get(row)
# Inner loop will loop through the columns
for (col in colnames(df)) {
# Get the data frame corresponding to the column
col_df <- get(col)
# If they are the same (ex. wave1/ wave1), leave it alone
if (row != col) {
# Find which column names are in both
var_both <- colnames(row_df)[which(colnames(row_df) %in% colnames(col_df))]
# Concatenate and add the value as a string at the intersection
df[row, col] <- paste(var_both, collapse = ", ")
}
}
}
但是如果你有很多数据帧,这在性能方面可能不是最有效的,因为它包含一个嵌套循环(即 O(n2))。
无论如何,输出是:
wave1 wave2 wave3 wave4
wave1 - id, a, x, y id, a, x, z id, a, z
wave2 id, a, x, y - id, a, b, c, x id, a, c
wave3 id, a, x, z id, a, b, c, x - id, a, c, i, j, l, z
wave4 id, a, z id, a, c id, a, c, i, j, l, z -
我们目前正在做一个爱好项目,有大量的面板数据(有很多波,即测量时间点),这可能非常具有挑战性。为了获得概览,一个想法是找到从当前波到所有连续波的共同变量。
举例说明:
Wave 1: Var1, Var2, Var3, Var4
Wave 2: Var1, Var2, Var4, Var5
Wave 3: Var1, Var5, Var6, Var7
此处Wave 1:Var1 和Var2 与Wave2 相同,只有Var1 与Wave3。 Wave2 与 Wave3 的共同点:Var1 和 Var5
期望输出
小标题(或 data.frame)在行中显示感兴趣的波,并在每个连续波的列中显示哪些变量是共同的。
Starting | Wave1 | Wave2 | Wave3 |
---|---|---|---|
Wave1 | - | Var1, Var2 | Var1 |
Wave2 | - | - | Var1, Var5 |
Wave3 | - | - | - |
模拟数据:
pacman::p_load(tidyverse)
wave1 <- tibble(
id = seq_along(1:100),
a = runif(100, 0, 100),
o = runif(100, 0, 100),
x = runif(100, 0, 100),
y = runif(100, 0, 100),
z = runif(100, 0, 100)
)
# In wave2 some observations drop out & some new observations are added
wave2 <- tibble(
id = seq_along(1:150),
a = runif(150, 0, 100),
b = runif(150, 0, 100),
c = runif(150, 0, 100),
d = runif(150, 0, 100),
e = runif(150, 0, 100),
x = runif(150, 0, 100),
y = runif(150, 0, 100)
)
# Simulation of Dropout
wave2 %>%
filter(!id %in% sample(1:150, 23)) -> wave2
# Same with Wave 3
wave3 <- tibble(
id = c(wave2 %>% pull(id),151:200),
a = runif(nrow(wave2) + 50, 0, 100),
b = runif(nrow(wave2) + 50, 0, 100),
c = runif(nrow(wave2) + 50, 0, 100),
i = runif(nrow(wave2) + 50, 0, 100),
j = runif(nrow(wave2) + 50, 0, 100),
k = runif(nrow(wave2) + 50, 0, 100),
l = runif(nrow(wave2) + 50, 0, 100),
x = runif(nrow(wave2) + 50, 0, 100),
z = runif(nrow(wave2) + 50, 0, 100)
)
# Simulation of Dropout
wave3 %>%
filter(!id %in% sample(1:200, 33)) -> wave3
# Same with Wave 4
wave4 <- tibble(
id = c(wave3 %>% pull(id),201:300),
a = runif(nrow(wave3) + 100, 0, 100),
c = runif(nrow(wave3) + 100, 0, 100),
i = runif(nrow(wave3) + 100, 0, 100),
j = runif(nrow(wave3) + 100, 0, 100),
l = runif(nrow(wave3) + 100, 0, 100),
z = runif(nrow(wave3) + 100, 0, 100)
)
# Simulation of Dropout in Wave 4
wave4 %>%
filter(!id %in% sample(1:200, 41)) -> wave4
在模拟数据中,例如变量 a
将出现在所有波中。
到目前为止我得到了什么
用 for-loop 遍历波的名称(通过 ls
和 regex-pattern 获得),获取当前位置和下一个位置,从中获取数据get
的环境并读出 colnames。使用 intersect
获取当前(就 for 循环而言)和下一波之间的公共列名称。将所有内容保存到 tibble(初始化为空)。
最后 group_by 并总结以在一栏中获得所有内容
##### Get Common Variables between all waves
# Get names of Tibbles (data) from environment
names_waves <- ls(pattern = "wave\d+") %>% str_sort(numeric = TRUE)
waves_lagged_common2 <- tibble(wave = character(),common_vars = character())
for (wave in names_waves) {
cur_pos <- names_waves %>% match(x = wave)
print(cur_pos)
if (cur_pos != length(names_waves)) {
cur_wave_names <- get(wave) %>% names()
next_wave_names <- get(names_waves[cur_pos + 1]) %>% names()
intersect(cur_wave_names,next_wave_names) -> common_vars
waves_lagged_common2 <- waves_lagged_common2 %>% add_row(wave = wave, common_vars = common_vars)
}
}
# Merge the rows with group_by and summarise
waves_lagged_common2 %>%
group_by(wave) %>%
summarize(common_vars = paste(common_vars, collapse = ", "))
当前输出:
# A tibble: 3 x 2
wave common_vars
<chr> <chr>
1 wave1 id, a, x, y
2 wave2 id, a, b, c, x
3 wave3 id, a, c, i, j, l, z
进一步努力实现预期产出
在 for-loop 内实施额外的 for-loop 以获得当前波 (cur_pos
) 所有连续波。并与 add_row
common_vars_matrix <- tibble(Wave = names_waves) %>%
column_to_rownames("Wave")
for (wave in names_waves) {
cur_pos <- names_waves %>% match(x = wave)
print(cur_pos)
if (cur_pos != length(names_waves)) {
cur_wave_names <- get(wave) %>% names()
next_row_to_add <- c()
for (next_wave in names_waves[cur_pos+1:length(names_waves)]) {
next_wave_names <- get(next_wave) %>% names()
intersect(cur_wave_names,next_wave_names) -> common_vars
print(next_wave)
# print(common_vars)
next_row_to_add <- c(next_row_to_add,common_vars)
print(next_row_to_add)
}
}
}
然而所有这些 for-loop 感觉都不是很整洁。
我也 运行 不喜欢索引运算符。在这一点上,我想知道使用循环和“自动化”过程是否是个好主意,或者像这样工作是否更容易/更具可读性
wave1 %>% names -> names_wave1
wave2 %>% names -> names_wave2
wave3 %>% names -> names_wave3
wave4 %>% names -> names_wave4
common_1_2 <- intersect(names_wave1,names_wave2)
common_1_3 <- intersect(names_wave1,names_wave3)
common_1_4 <- intersect(names_wave1,names_wave4)
common_vars_matrix <- tibble(wave2=paste(common_1_2,collapse=","),
wave3=paste(common_1_3, collapse=","),
wave4=paste(common_1_4, collapse=","))
common_2_3 <- intersect(names_wave2,names_wave3)
common_2_4 <- intersect(names_wave2,names_wave4)
common_vars_matrix <- common_vars_matrix %>% add_row(wave2="-",
wave3=paste(common_2_3, collapse=","),
wave4=paste(common_2_4, collapse=","))
# And so forth
给定你想要的输出,这似乎给了你你想要的。
# Initialize a data frame with row names of your waves
df <- data.frame(row.names = ls(pattern = "wave"))
# Add in the dashes that you want
for (row in rownames(df)) {
df[, row] = "-"
}
# Outer loop will loop through rows
for (row in rownames(df)) {
# Get the data frame corresponding to that row
row_df <- get(row)
# Inner loop will loop through the columns
for (col in colnames(df)) {
# Get the data frame corresponding to the column
col_df <- get(col)
# If they are the same (ex. wave1/ wave1), leave it alone
if (row != col) {
# Find which column names are in both
var_both <- colnames(row_df)[which(colnames(row_df) %in% colnames(col_df))]
# Concatenate and add the value as a string at the intersection
df[row, col] <- paste(var_both, collapse = ", ")
}
}
}
但是如果你有很多数据帧,这在性能方面可能不是最有效的,因为它包含一个嵌套循环(即 O(n2))。
无论如何,输出是:
wave1 wave2 wave3 wave4
wave1 - id, a, x, y id, a, x, z id, a, z
wave2 id, a, x, y - id, a, b, c, x id, a, c
wave3 id, a, x, z id, a, b, c, x - id, a, c, i, j, l, z
wave4 id, a, z id, a, c id, a, c, i, j, l, z -