基于其他列变量的 R 长格式行之间的差异

Difference between rows in long format for R based on other column variables

我有一个 R 数据框,例如:

df <- data.frame(ID = rep(c(1, 1, 2, 2), 2), Condition = rep(c("A", "B"),4), 
                        Variable = c(rep("X", 4), rep("Y", 4)),
                        Value = c(3, 5, 6, 6, 3, 8, 3, 6))

  ID Condition Variable Value
1  1         A        X     3
2  1         B        X     5
3  2         A        X     6
4  2         B        X     6
5  1         A        Y     3
6  1         B        Y     8
7  2         A        Y     3
8  2         B        Y     6

我想获得每个 VariableID 的每个 Condition (A - B) 值之间的差值,同时保持长格式。这意味着该值必须每两行出现一次,如下所示:

 ID Condition Variable Value diff_value
1  1         A        X     3         -2
2  1         B        X     5         -2
3  2         A        X     6          0
4  2         B        X     6          0
5  1         A        Y     3         -5
6  1         B        Y     8         -5
7  2         A        Y     3         -3
8  2         B        Y     6         -3

到目前为止,我设法使用 dplyr 包做了一些相对类似的事情,但如果我想保持长格式,它就不起作用了:

df_long_example %>%
  group_by(Variable, ID) %>%
  mutate(diff_value = lag(Value, default = Value[1]) -Value)

# A tibble: 8 x 5
# Groups:   Variable, ID [4]
     ID Condition Variable Value diff_value
  <dbl> <chr>     <chr>    <dbl>      <dbl>
1     1 A         X            3          0
2     1 B         X            5         -2
3     2 A         X            6          0
4     2 B         X            6          0
5     1 A         Y            3          0
6     1 B         Y            8         -5
7     2 A         Y            3          0
8     2 B         Y            6         -3

您不必使用 lag,但使用 diff:

df %>% 
  group_by(Variable,ID) %>% 
  mutate(diff = -diff(Value))

输出:

# A tibble: 8 x 5
# Groups:   Variable, ID [4]
     ID Condition Variable Value  diff
  <dbl> <chr>     <chr>    <dbl> <dbl>
1     1 A         X            3    -2
2     1 B         X            5    -2
3     2 A         X            6     0
4     2 B         X            6     0
5     1 A         Y            3    -5
6     1 B         Y            8    -5
7     2 A         Y            3    -3
8     2 B         Y            6    -3

您不需要创建滞后变量只需使用 Value[Condition == "A"] - Value[Condition == "B"] 如下

df %>% 
  group_by(ID, Variable) %>%
  mutate(Value, diff_value = Value[Condition == "A"] - Value[Condition == "B"])

# A tibble: 8 x 5
# Groups:   ID, Variable [4]
     ID Condition Variable Value diff_value
  <dbl> <chr>     <chr>    <dbl>      <dbl>
1     1 A         X            3         -2
2     1 B         X            5         -2
3     2 A         X            6          0
4     2 B         X            6          0
5     1 A         Y            3         -5
6     1 B         Y            8         -5
7     2 A         Y            3         -3
8     2 B         Y            6         -3

这应该有效:

# Step one: create a new column of df, where we store the "Value" we need
# to add/subtract, as you required (same "ID", same "Variable", different
# "Condtion").

temp.fun = function(x, dta)
{
  # Given a row x of dta, this function selects the value corresponding to the row
  # with same "ID", same "Variable" and different "Condition".
  
  # Notice that if "Condition" is not binary, we need to generalize this function.
  
  # Notice also that this function is super specific to your case, and that it has
  # been thought to be used within apply().
  
  # INPUTS:
  #   - x, a row of a data frame.
  #   - dta, the data frame (df, in your case).
  
  # OUTPUT:
  #   - temp.corresponding, "Value" you want for each row.
  
  
  # Saving information.
  temp.id = as.numeric(x["ID"])
  temp.condition = as.character(x["Condition"])
  temp.variable = as.character(x["Variable"])
  
  # Index for selecting row.
  temp.row = dta$ID == temp.id & dta$Condition != temp.condition & dta$Variable == temp.variable
  
  # Selecting "Value".
  temp.corresponding = dta$Value[temp.row]
  
  return(temp.corresponding)
}

df$corr_value = apply(df, MARGIN = 1, FUN = temp.fun, dta = df)

# Step two: add/subtract to create the column "diff_value".
# Key: if "Condition" equals "A", we subtract, otherwise we add.

df$diff_value = NA
df$diff_value[df$Condition == "A"] = df$Value[df$Condition == "A"] - df$corr_value[df$Condition == "A"]
df$diff_value[df$Condition == "B"] = df$corr_value[df$Condition == "B"] - df$Value[df$Condition == "B"]

请注意,此解决方案恰好符合您问题的具体情况,可能既不优雅也不高效。

我在代码中写了注释来解释这个解决方案是如何工作的。无论如何,我们的想法是首先编写函数 temp.fun(),它对单行进行操作:对于我们传递的每一行,它都会找到满足您要求的条件的行中的 df$Value(相同的 ID , 相同 Variable, 不同 Condition)。然后,我们使用 apply() 传递 temp.fun() 中的所有行,从而在 df 中创建一个新列来存储上面提到的 Value

我们现在可以计算 df$diff_value。首先,我们初始化 space,在 NA 上创建一个列。然后,我们执行操作。注意:由于问题的特殊性,如果 Condition 等于 A,我们要减去值,当 Condition 等于 B 时,我们是否要加值。也就是说,在前一种情况下我们计算df$Value - df$corr_value,而在后一种情况下我们计算df$corr_value- df$Value.

最后警告:如果 Condition 不是二进制的,则此解决方案必须泛化才能工作。