基于其他列变量的 R 长格式行之间的差异
Difference between rows in long format for R based on other column variables
我有一个 R 数据框,例如:
df <- data.frame(ID = rep(c(1, 1, 2, 2), 2), Condition = rep(c("A", "B"),4),
Variable = c(rep("X", 4), rep("Y", 4)),
Value = c(3, 5, 6, 6, 3, 8, 3, 6))
ID Condition Variable Value
1 1 A X 3
2 1 B X 5
3 2 A X 6
4 2 B X 6
5 1 A Y 3
6 1 B Y 8
7 2 A Y 3
8 2 B Y 6
我想获得每个 Variable
和 ID
的每个 Condition
(A - B) 值之间的差值,同时保持长格式。这意味着该值必须每两行出现一次,如下所示:
ID Condition Variable Value diff_value
1 1 A X 3 -2
2 1 B X 5 -2
3 2 A X 6 0
4 2 B X 6 0
5 1 A Y 3 -5
6 1 B Y 8 -5
7 2 A Y 3 -3
8 2 B Y 6 -3
到目前为止,我设法使用 dplyr
包做了一些相对类似的事情,但如果我想保持长格式,它就不起作用了:
df_long_example %>%
group_by(Variable, ID) %>%
mutate(diff_value = lag(Value, default = Value[1]) -Value)
# A tibble: 8 x 5
# Groups: Variable, ID [4]
ID Condition Variable Value diff_value
<dbl> <chr> <chr> <dbl> <dbl>
1 1 A X 3 0
2 1 B X 5 -2
3 2 A X 6 0
4 2 B X 6 0
5 1 A Y 3 0
6 1 B Y 8 -5
7 2 A Y 3 0
8 2 B Y 6 -3
您不必使用 lag
,但使用 diff
:
df %>%
group_by(Variable,ID) %>%
mutate(diff = -diff(Value))
输出:
# A tibble: 8 x 5
# Groups: Variable, ID [4]
ID Condition Variable Value diff
<dbl> <chr> <chr> <dbl> <dbl>
1 1 A X 3 -2
2 1 B X 5 -2
3 2 A X 6 0
4 2 B X 6 0
5 1 A Y 3 -5
6 1 B Y 8 -5
7 2 A Y 3 -3
8 2 B Y 6 -3
您不需要创建滞后变量只需使用 Value[Condition == "A"] - Value[Condition == "B"]
如下
df %>%
group_by(ID, Variable) %>%
mutate(Value, diff_value = Value[Condition == "A"] - Value[Condition == "B"])
# A tibble: 8 x 5
# Groups: ID, Variable [4]
ID Condition Variable Value diff_value
<dbl> <chr> <chr> <dbl> <dbl>
1 1 A X 3 -2
2 1 B X 5 -2
3 2 A X 6 0
4 2 B X 6 0
5 1 A Y 3 -5
6 1 B Y 8 -5
7 2 A Y 3 -3
8 2 B Y 6 -3
这应该有效:
# Step one: create a new column of df, where we store the "Value" we need
# to add/subtract, as you required (same "ID", same "Variable", different
# "Condtion").
temp.fun = function(x, dta)
{
# Given a row x of dta, this function selects the value corresponding to the row
# with same "ID", same "Variable" and different "Condition".
# Notice that if "Condition" is not binary, we need to generalize this function.
# Notice also that this function is super specific to your case, and that it has
# been thought to be used within apply().
# INPUTS:
# - x, a row of a data frame.
# - dta, the data frame (df, in your case).
# OUTPUT:
# - temp.corresponding, "Value" you want for each row.
# Saving information.
temp.id = as.numeric(x["ID"])
temp.condition = as.character(x["Condition"])
temp.variable = as.character(x["Variable"])
# Index for selecting row.
temp.row = dta$ID == temp.id & dta$Condition != temp.condition & dta$Variable == temp.variable
# Selecting "Value".
temp.corresponding = dta$Value[temp.row]
return(temp.corresponding)
}
df$corr_value = apply(df, MARGIN = 1, FUN = temp.fun, dta = df)
# Step two: add/subtract to create the column "diff_value".
# Key: if "Condition" equals "A", we subtract, otherwise we add.
df$diff_value = NA
df$diff_value[df$Condition == "A"] = df$Value[df$Condition == "A"] - df$corr_value[df$Condition == "A"]
df$diff_value[df$Condition == "B"] = df$corr_value[df$Condition == "B"] - df$Value[df$Condition == "B"]
请注意,此解决方案恰好符合您问题的具体情况,可能既不优雅也不高效。
我在代码中写了注释来解释这个解决方案是如何工作的。无论如何,我们的想法是首先编写函数 temp.fun()
,它对单行进行操作:对于我们传递的每一行,它都会找到满足您要求的条件的行中的 df$Value
(相同的 ID
, 相同 Variable
, 不同 Condition
)。然后,我们使用 apply()
传递 temp.fun()
中的所有行,从而在 df
中创建一个新列来存储上面提到的 Value
。
我们现在可以计算 df$diff_value
。首先,我们初始化 space,在 NA
上创建一个列。然后,我们执行操作。注意:由于问题的特殊性,如果 Condition
等于 A
,我们要减去值,当 Condition
等于 B
时,我们是否要加值。也就是说,在前一种情况下我们计算df$Value - df$corr_value
,而在后一种情况下我们计算df$corr_value- df$Value
.
最后警告:如果 Condition
不是二进制的,则此解决方案必须泛化才能工作。
我有一个 R 数据框,例如:
df <- data.frame(ID = rep(c(1, 1, 2, 2), 2), Condition = rep(c("A", "B"),4),
Variable = c(rep("X", 4), rep("Y", 4)),
Value = c(3, 5, 6, 6, 3, 8, 3, 6))
ID Condition Variable Value
1 1 A X 3
2 1 B X 5
3 2 A X 6
4 2 B X 6
5 1 A Y 3
6 1 B Y 8
7 2 A Y 3
8 2 B Y 6
我想获得每个 Variable
和 ID
的每个 Condition
(A - B) 值之间的差值,同时保持长格式。这意味着该值必须每两行出现一次,如下所示:
ID Condition Variable Value diff_value
1 1 A X 3 -2
2 1 B X 5 -2
3 2 A X 6 0
4 2 B X 6 0
5 1 A Y 3 -5
6 1 B Y 8 -5
7 2 A Y 3 -3
8 2 B Y 6 -3
到目前为止,我设法使用 dplyr
包做了一些相对类似的事情,但如果我想保持长格式,它就不起作用了:
df_long_example %>%
group_by(Variable, ID) %>%
mutate(diff_value = lag(Value, default = Value[1]) -Value)
# A tibble: 8 x 5
# Groups: Variable, ID [4]
ID Condition Variable Value diff_value
<dbl> <chr> <chr> <dbl> <dbl>
1 1 A X 3 0
2 1 B X 5 -2
3 2 A X 6 0
4 2 B X 6 0
5 1 A Y 3 0
6 1 B Y 8 -5
7 2 A Y 3 0
8 2 B Y 6 -3
您不必使用 lag
,但使用 diff
:
df %>%
group_by(Variable,ID) %>%
mutate(diff = -diff(Value))
输出:
# A tibble: 8 x 5
# Groups: Variable, ID [4]
ID Condition Variable Value diff
<dbl> <chr> <chr> <dbl> <dbl>
1 1 A X 3 -2
2 1 B X 5 -2
3 2 A X 6 0
4 2 B X 6 0
5 1 A Y 3 -5
6 1 B Y 8 -5
7 2 A Y 3 -3
8 2 B Y 6 -3
您不需要创建滞后变量只需使用 Value[Condition == "A"] - Value[Condition == "B"]
如下
df %>%
group_by(ID, Variable) %>%
mutate(Value, diff_value = Value[Condition == "A"] - Value[Condition == "B"])
# A tibble: 8 x 5
# Groups: ID, Variable [4]
ID Condition Variable Value diff_value
<dbl> <chr> <chr> <dbl> <dbl>
1 1 A X 3 -2
2 1 B X 5 -2
3 2 A X 6 0
4 2 B X 6 0
5 1 A Y 3 -5
6 1 B Y 8 -5
7 2 A Y 3 -3
8 2 B Y 6 -3
这应该有效:
# Step one: create a new column of df, where we store the "Value" we need
# to add/subtract, as you required (same "ID", same "Variable", different
# "Condtion").
temp.fun = function(x, dta)
{
# Given a row x of dta, this function selects the value corresponding to the row
# with same "ID", same "Variable" and different "Condition".
# Notice that if "Condition" is not binary, we need to generalize this function.
# Notice also that this function is super specific to your case, and that it has
# been thought to be used within apply().
# INPUTS:
# - x, a row of a data frame.
# - dta, the data frame (df, in your case).
# OUTPUT:
# - temp.corresponding, "Value" you want for each row.
# Saving information.
temp.id = as.numeric(x["ID"])
temp.condition = as.character(x["Condition"])
temp.variable = as.character(x["Variable"])
# Index for selecting row.
temp.row = dta$ID == temp.id & dta$Condition != temp.condition & dta$Variable == temp.variable
# Selecting "Value".
temp.corresponding = dta$Value[temp.row]
return(temp.corresponding)
}
df$corr_value = apply(df, MARGIN = 1, FUN = temp.fun, dta = df)
# Step two: add/subtract to create the column "diff_value".
# Key: if "Condition" equals "A", we subtract, otherwise we add.
df$diff_value = NA
df$diff_value[df$Condition == "A"] = df$Value[df$Condition == "A"] - df$corr_value[df$Condition == "A"]
df$diff_value[df$Condition == "B"] = df$corr_value[df$Condition == "B"] - df$Value[df$Condition == "B"]
请注意,此解决方案恰好符合您问题的具体情况,可能既不优雅也不高效。
我在代码中写了注释来解释这个解决方案是如何工作的。无论如何,我们的想法是首先编写函数 temp.fun()
,它对单行进行操作:对于我们传递的每一行,它都会找到满足您要求的条件的行中的 df$Value
(相同的 ID
, 相同 Variable
, 不同 Condition
)。然后,我们使用 apply()
传递 temp.fun()
中的所有行,从而在 df
中创建一个新列来存储上面提到的 Value
。
我们现在可以计算 df$diff_value
。首先,我们初始化 space,在 NA
上创建一个列。然后,我们执行操作。注意:由于问题的特殊性,如果 Condition
等于 A
,我们要减去值,当 Condition
等于 B
时,我们是否要加值。也就是说,在前一种情况下我们计算df$Value - df$corr_value
,而在后一种情况下我们计算df$corr_value- df$Value
.
最后警告:如果 Condition
不是二进制的,则此解决方案必须泛化才能工作。