基于其他列变量的 R 长格式行之间的差异

Question

我有一个 R 数据框，例如：

df <- data.frame(ID = rep(c(1, 1, 2, 2), 2), Condition = rep(c("A", "B"),4), 
                        Variable = c(rep("X", 4), rep("Y", 4)),
                        Value = c(3, 5, 6, 6, 3, 8, 3, 6))

  ID Condition Variable Value
1  1         A        X     3
2  1         B        X     5
3  2         A        X     6
4  2         B        X     6
5  1         A        Y     3
6  1         B        Y     8
7  2         A        Y     3
8  2         B        Y     6

我想获得每个 Variable 和 ID 的每个 Condition (A - B) 值之间的差值，同时保持长格式。这意味着该值必须每两行出现一次，如下所示：

 ID Condition Variable Value diff_value
1  1         A        X     3         -2
2  1         B        X     5         -2
3  2         A        X     6          0
4  2         B        X     6          0
5  1         A        Y     3         -5
6  1         B        Y     8         -5
7  2         A        Y     3         -3
8  2         B        Y     6         -3

到目前为止，我设法使用 dplyr 包做了一些相对类似的事情，但如果我想保持长格式，它就不起作用了：

df_long_example %>%
  group_by(Variable, ID) %>%
  mutate(diff_value = lag(Value, default = Value[1]) -Value)

# A tibble: 8 x 5
# Groups:   Variable, ID [4]
     ID Condition Variable Value diff_value
  <dbl> <chr>     <chr>    <dbl>      <dbl>
1     1 A         X            3          0
2     1 B         X            5         -2
3     2 A         X            6          0
4     2 B         X            6          0
5     1 A         Y            3          0
6     1 B         Y            8         -5
7     2 A         Y            3          0
8     2 B         Y            6         -3

Answer 1

您不必使用 lag，但使用 diff:

df %>% 
  group_by(Variable,ID) %>% 
  mutate(diff = -diff(Value))

输出：

# A tibble: 8 x 5
# Groups:   Variable, ID [4]
     ID Condition Variable Value  diff
  <dbl> <chr>     <chr>    <dbl> <dbl>
1     1 A         X            3    -2
2     1 B         X            5    -2
3     2 A         X            6     0
4     2 B         X            6     0
5     1 A         Y            3    -5
6     1 B         Y            8    -5
7     2 A         Y            3    -3
8     2 B         Y            6    -3

Answer 2

您不需要创建滞后变量只需使用 Value[Condition == "A"] - Value[Condition == "B"] 如下

df %>% 
  group_by(ID, Variable) %>%
  mutate(Value, diff_value = Value[Condition == "A"] - Value[Condition == "B"])

# A tibble: 8 x 5
# Groups:   ID, Variable [4]
     ID Condition Variable Value diff_value
  <dbl> <chr>     <chr>    <dbl>      <dbl>
1     1 A         X            3         -2
2     1 B         X            5         -2
3     2 A         X            6          0
4     2 B         X            6          0
5     1 A         Y            3         -5
6     1 B         Y            8         -5
7     2 A         Y            3         -3
8     2 B         Y            6         -3

Answer 3

这应该有效：

# Step one: create a new column of df, where we store the "Value" we need
# to add/subtract, as you required (same "ID", same "Variable", different
# "Condtion").

temp.fun = function(x, dta)
{
  # Given a row x of dta, this function selects the value corresponding to the row
  # with same "ID", same "Variable" and different "Condition".
  
  # Notice that if "Condition" is not binary, we need to generalize this function.
  
  # Notice also that this function is super specific to your case, and that it has
  # been thought to be used within apply().
  
  # INPUTS:
  #   - x, a row of a data frame.
  #   - dta, the data frame (df, in your case).
  
  # OUTPUT:
  #   - temp.corresponding, "Value" you want for each row.
  
  
  # Saving information.
  temp.id = as.numeric(x["ID"])
  temp.condition = as.character(x["Condition"])
  temp.variable = as.character(x["Variable"])
  
  # Index for selecting row.
  temp.row = dta$ID == temp.id & dta$Condition != temp.condition & dta$Variable == temp.variable
  
  # Selecting "Value".
  temp.corresponding = dta$Value[temp.row]
  
  return(temp.corresponding)
}

df$corr_value = apply(df, MARGIN = 1, FUN = temp.fun, dta = df)

# Step two: add/subtract to create the column "diff_value".
# Key: if "Condition" equals "A", we subtract, otherwise we add.

df$diff_value = NA
df$diff_value[df$Condition == "A"] = df$Value[df$Condition == "A"] - df$corr_value[df$Condition == "A"]
df$diff_value[df$Condition == "B"] = df$corr_value[df$Condition == "B"] - df$Value[df$Condition == "B"]

请注意，此解决方案恰好符合您问题的具体情况，可能既不优雅也不高效。

我在代码中写了注释来解释这个解决方案是如何工作的。无论如何，我们的想法是首先编写函数 temp.fun()，它对单行进行操作：对于我们传递的每一行，它都会找到满足您要求的条件的行中的 df$Value（相同的 ID , 相同 Variable, 不同 Condition)。然后，我们使用 apply() 传递 temp.fun() 中的所有行，从而在 df 中创建一个新列来存储上面提到的 Value。

我们现在可以计算 df$diff_value。首先，我们初始化 space，在 NA 上创建一个列。然后，我们执行操作。注意：由于问题的特殊性，如果 Condition 等于 A，我们要减去值，当 Condition 等于 B 时，我们是否要加值。也就是说，在前一种情况下我们计算df$Value - df$corr_value，而在后一种情况下我们计算df$corr_value- df$Value.

最后警告：如果 Condition 不是二进制的，则此解决方案必须泛化才能工作。

基于其他列变量的 R 长格式行之间的差异

Difference between rows in long format for R based on other column variables

r

calculated-columns

dplyr