如何根据r中的二进制条件表示中心变量

Question

我有一个包含多个变量的数据框（“md”），其中一个是二进制的（“adopter”）。我想说的是其他（连续）变量中的三个中心，比方说 X、Y 和 Z，但仅适用于采用者 = 1 的变量。其他采用者 = 0 的变量应保持不变。最后，我想像以前一样得到一个包含所有变量的新数据框，但是采用采用者 = 1 的 X、Y 和 Z 均值居中，同时保留采用者 = 的 X、Y 和 Z 0 不变。

我的数据框看起来像这样（总共 117 个观察值）：

adopter	X	Y	Z	A	B
0	0.5	2.3	4.5	3	4.7
1	1.5	6.5	-2.3	69.3	-2.5
...	...	...	...

所以新的dataframe应该包含本例中第二行X、Y、Z的中心均值，如adopter=1，其余保持不变。

我知道如何将所有 X、Y 和 Z 居中：

md_cen <- md

covs_to_center <- c("X", "Y", "Z")
md_cen[covs_to_center] <- scale(md_cen[covs_to_center], 
                                scale = FALSE)

但我不知道如何将“仅当采用者 == “1””放入其中。我还尝试应用一个函数：

center_apply <- function(x) {
  apply(x, 2, function(y) y - mean(y))}

然而，这又给我留下了所有 X、Y、Z 的均值居中版本，当然，新数据集仅包含这三个变量。

有人能帮我解决这个问题吗？

Answer 1

完成您想要做的事情的基本方法是使用 split-apply-combine 工作流程。即：

将您的数据框拆分成连贯且有用的 sub-parts。
每个人都做自己想做的事 sub-part。
化整为零

首先，这是一个玩具数据集：

covs_to_center <- c("X", "Y", "Z")

set.seed(123)

md <- data.frame(
  adopter = sample(0:1, 10, replace = T),
  X = rnorm(10, 2, 1),
  Y = rnorm(10, 3, 2),
  Z = rnorm(10, 5, 10),
  A = rnorm(10, 40, 50),
  B = rnorm(10, 0, 2)
)

md

##    adopter         X          Y          Z          A           B
## 1        0 3.7150650  6.5738263 -11.866933  74.432013 -2.24621717
## 2        0 2.4609162  3.9957010  13.377870  67.695883 -0.80576967
## 3        0 0.7349388 -0.9332343   6.533731  36.904414 -0.93331071
## 4        1 1.3131471  4.4027118  -6.381369  24.701867  1.55993024
## 5        0 1.5543380  2.0544172  17.538149  20.976450 -0.16673813
## 6        1 3.2240818  0.8643526   9.264642   5.264651  0.50663703
## 7        1 2.3598138  2.5640502   2.049285  29.604136 -0.05709351
## 8        1 2.4007715  0.9479911  13.951257 -23.269818 -0.08574091
## 9        0 2.1106827  1.5422175  13.781335 148.447798  2.73720457
## 10       0 1.4441589  1.7499215  13.215811 100.398100 -0.45154197

AbaseR解法：

md_base <- data.frame(row_num = 1:nrow(md), md)
  # append column of row numbers to make it easier to recombine things later

md_split <- split(md_base, md_base$adopter)
  # this is a list of 2 data frames, corresponding to the 2 possible outcomes
  # of the adopter variable

md_split$`1`[, covs_to_center] <-
  apply(md_split$`1`[, covs_to_center], 2, function(y) y - mean(y))
  # grab the data frame that had a 1 in the response column; apply the centering
  # function to the correct variables in that data frame

md_new <- do.call(rbind, md_split)
  # glue the data frame back together; it will be ordered by adopter

rownames(md_new) <- NULL
  # remove row name artifact created by joining

md_new <- md_new[order(md_new$row_num), names(md_new) != "row_num"]
  # sort by the row_num column, then drop it

这很笨拙，我相信它可以改进。这是产生相同输出的 tidyverse 等价物：

library(tidyverse)

md %>%
  group_by(adopter) %>%
  mutate(across(covs_to_center, function(y) y - adopter * mean(y))) %>%
  ungroup()

这背后的想法是：按采用者分组（很像split()方法），计算每个组内相关变量的mean()，然后减去子组的平均值乘以采用者变量（意思是当adopter == 0时，什么都不会减去）。

如何根据r中的二进制条件表示中心变量

How to mean center variables based on binary condition in r

r

binary-data