如何在纵向分析中为每个人计算 R 中的回归残差?
How to calculate regression residuals in R for each individual in a longitudinal analysis?
我正在研究 longitudinal/repeated 度量多层次模型 (MLM)。通常,对于随时间变化的协变量(在我的例子中 "weekly gross income/1000"),你会计算变量的以人均值为中心的版本(即从整个人每周收入的平均值中减去人年收入反应)说的人的时间点)。然而,这可能会导致偏差 (see here),因此更好(更通用)的方法是围绕每个个体的回归线(碰巧的是,回归的残差用于此目的)。
因此,我需要计算以下回归,但对于每个人(大约 10,000 个人和 25,000 个观察值):
lm(Weekly_Gross_Pay_Main_Job~nYear, data=df)
然后,真正关键的部分是我需要将残差提取到我的主数据集中的单独列中,与每个人匹配。这些残差将取代我的以组均值为中心的变量(它又将在我的 MLM 中使用)。
这里是一个可能的起点,使用我的组均值居中函数。如果这可以更新以适合每个人的残差输出的回归,那将是理想的(如果不是,那么我对其他方法持开放态度):
#Group mean-centering a variable. Relevant for L1 variables only.
gmc = function(variable, group){
return(ave(variable, group, FUN = function(x){x - mean(x)}))
}
df$Weekly_Gross_Pay_Main_Jobgmc <- gmc(df$Weekly_Gross_Pay_Main_Job, df$Person_ID)
长格式数据提取(其中Person_ID
是人,nYear
是时间,Weekly_Gross_Pay_Main_Job
是每周income/1000,Weekly_Gross_Pay_Main_Jobgmc
是组-均值居中版本):
structure(list(Person_ID = c(100003L, 100003L, 100003L, 100006L,
100006L, 100006L, 100006L, 100010L, 100010L, 100010L, 100010L,
100010L, 100010L, 100011L, 100014L, 100014L, 100014L, 100014L,
100014L, 100016L, 100018L, 100018L, 100018L, 100018L, 100018L,
100018L, 100018L, 100018L, 100018L, 100020L, 100020L, 100020L,
100020L, 100020L, 100020L, 100020L, 100020L, 100020L, 100021L,
100021L, 100024L, 100024L, 100024L, 100024L, 100024L, 100024L,
100024L, 100024L, 100024L, 100024L, 100025L, 100025L, 100025L,
100025L, 100025L, 100025L, 100025L, 100025L, 100027L, 100027L,
100027L, 100027L, 100029L, 100029L, 100029L, 100029L, 100029L,
100031L, 100031L, 100031L, 100032L, 100032L, 100032L, 100033L,
100033L, 100033L, 100033L, 100033L, 100033L, 100034L, 100034L,
100034L, 100037L, 100037L, 100037L, 100037L, 100037L, 100037L,
100037L, 100044L, 100044L, 100044L, 100044L, 100044L, 100044L,
100044L, 100045L, 100045L, 100045L, 100045L), nYear = c(5L, 6L,
7L, 2L, 3L, 4L, 6L, 5L, 6L, 7L, 8L, 9L, 10L, 1L, 5L, 6L, 7L,
8L, 9L, 5L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 1L, 2L, 3L, 4L,
5L, 6L, 7L, 8L, 9L, 1L, 2L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L,
13L, 14L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 1L, 2L, 3L, 4L, 5L,
6L, 7L, 8L, 9L, 4L, 5L, 6L, 1L, 2L, 3L, 3L, 4L, 5L, 6L, 7L, 8L,
2L, 3L, 5L, 5L, 6L, 7L, 8L, 9L, 11L, 13L, 2L, 3L, 4L, 6L, 7L,
8L, 9L, 4L, 5L, 6L, 7L), Weekly_Gross_Pay_Main_Job = c(0, 0.58,
0.35, 0.035, 0.65, 0.195, 0.43, 0, 0, 0, 0, 0, 0, 0.12, 1.653,
0.967, 1.742, 1.323, 0, 0.709, 0.155, 0.431, 0.235, 0.17, 0.285,
0.357, 0.28, 0.335, 0.375, 0.111, 0.333, 0.582, 0.882, 0.85,
0.944, 1.615, 1.615, 1.35, 0.168, 0.08, 0, 0, 0, 0, 0, 0, 0,
0.134, 0.737, 0, 0.02, 0.372, 0.1, 0.014, 0.307, 0.39, 0.671,
0.5, 0.278, 0.32, 0.425, 0.4, 0.57, 0.917, 0.75, 0.402, 0.437,
0.211, 0.537, 0.54, 0.135, 0.15, 0.65, 0.324, 0.399, 0.497, 0.67,
0.825, 0.825, 0.25, 0.319, 0.35, 0.885, 0.941, 0.975, 0.975,
1.02, 1.096, 1.148, 0.1, 0.11, 0.413, 0.477, 0.578, 0.686, 0.686,
0.511, 0.578, 0.8, 0.75), Weekly_Gross_Pay_Main_Jobgmc = c(-0.31,
0.27, 0.04, -0.2925, 0.3225, -0.1325, 0.1025, 0, 0, 0, 0, 0,
0, 0, 0.516, -0.17, 0.605, 0.186, -1.137, 0, -0.136444444444444,
0.139555555555556, -0.0564444444444445, -0.121444444444444, -0.00644444444444447,
0.0655555555555555, -0.0114444444444444, 0.0435555555555556,
0.0835555555555555, -0.809222222222222, -0.587222222222222, -0.338222222222222,
-0.0382222222222223, -0.0702222222222223, 0.0237777777777777,
0.694777777777778, 0.694777777777778, 0.429777777777778, 0.044,
-0.044, -0.0871, -0.0871, -0.0871, -0.0871, -0.0871, -0.0871,
-0.0871, 0.0469, 0.6499, -0.0871, -0.27675, 0.07525, -0.19675,
-0.28275, 0.01025, 0.09325, 0.37425, 0.20325, -0.07775, -0.03575,
0.06925, 0.04425, -0.0452, 0.3018, 0.1348, -0.2132, -0.1782,
-0.218333333333333, 0.107666666666667, 0.110666666666667, -0.176666666666667,
-0.161666666666667, 0.338333333333333, -0.266, -0.191, -0.093,
0.0800000000000001, 0.235, 0.235, -0.0563333333333333, 0.0126666666666667,
0.0436666666666666, -0.120714285714286, -0.0647142857142858,
-0.0307142857142858, -0.0307142857142858, 0.0142857142857142,
0.0902857142857143, 0.142285714285714, -0.335714285714286, -0.325714285714286,
-0.0227142857142857, 0.0412857142857143, 0.142285714285714, 0.250285714285714,
0.250285714285714, -0.1368, -0.0698000000000001, 0.1522, 0.1022
)), row.names = c(NA, 100L), class = "data.frame")
不确定我是否没看错,这可能是一个非常幼稚的回答,没有抓住要点,但 "residuals" 并不奏效。
这是一个线性混合效应模型,其中包含我周围的一些数据
some.model<-lme(DV~IV, random=~1|Id, data=df)
head(residuals(some.model))
7 7 24 24 32 32
-0.054135825 -0.054135825 0.064271638 0.064271638 -0.001975424 -0.001975424
如果您真的想将其放入旁边带有 ID 编号的列中,则需要执行更多步骤。应该可以一步完成,但是我真的很烂
extra.column<-residuals(some.model)
extra.column.id<-names(residuals(some.model))
extra.column<-residuals(some.model)
cbind(extra.column,extra.column.id)
extra.column extra.column.id
7 "-0.0541358252373243" "7"
7 "-0.0541358252373243" "7"
24 "0.0642716380035857" "24"
24 "0.0642716380035857" "24"
32 "-0.0019754241828096" "32"
32 "-0.0019754241828096" "32"
抱歉,如果这不是您要查找的内容,请查看残差命令。
这是我最终的做法:
#Before you begin, time needs to be grand-mean centered.
df$nYearmc <- df$nYear-mean(df$nYear, na.rm=TRUE)
#Now to regress the time-varying covariate onto grand-mean centered time and complete the process.
#First, create a group called `by_person`.
df <- tidyr::unite(df, Person_Year, c(Person_ID, nYearmc), remove=FALSE)
by_Person <- dplyr::group_by(df, Person_ID)
#Second, regress the time-varying covariate onto the newly created grand-mean centered time variable and merge with the main data frame.
df.Weekly_Gross_Pay_Main_Job <- dplyr::do(by_Person, augment(lm(Weekly_Gross_Pay_Main_Job~nYearmc, data=.)))
df.Weekly_Gross_Pay_Main_Job <- tidyr::unite(df.Weekly_Gross_Pay_Main_Job, Person_Year, c(Person_ID, nYearmc), remove=FALSE)
df <- merge(df, df.Weekly_Gross_Pay_Main_Job, by="Person_Year")
#Third, copy over the required columns (renaming them would be more efficient, but either way).
df$RegResGrossPay <- df$.resid
#Fourth, do an optional tidy up.
colnames(df)[colnames(df)=="Person_ID.x"] <- "Person_ID"
colnames(df)[colnames(df)=="nYearmc.x"] <- "nYearmc"
colnames(df)[colnames(df)=="Weekly_Gross_Pay_Main_Job.x"] <- "Weekly_Gross_Pay_Main_Job"
df$Person_ID.y <- NULL
df$nYearmc.y <- NULL
df$Weekly_Gross_Pay_Main_Job.y <- NULL
df$.fitted <- NULL
df$.se.fit <- NULL
df$.resid <- NULL
df$.hat <- NULL
df$.sigma <- NULL
df$.cooksd <- NULL
df$.std.resid <- NULL
df.Weekly_Gross_Pay_Main_Job <- NULL
#Fifth, generate plots of the variables you need.
ggplot(df, aes(nYearmc, RegResGrossPay))+geom_line(aes(group=Person_ID), alpha =1/3)+geom_smooth(method="lm",se=FALSE)
我正在研究 longitudinal/repeated 度量多层次模型 (MLM)。通常,对于随时间变化的协变量(在我的例子中 "weekly gross income/1000"),你会计算变量的以人均值为中心的版本(即从整个人每周收入的平均值中减去人年收入反应)说的人的时间点)。然而,这可能会导致偏差 (see here),因此更好(更通用)的方法是围绕每个个体的回归线(碰巧的是,回归的残差用于此目的)。
因此,我需要计算以下回归,但对于每个人(大约 10,000 个人和 25,000 个观察值):
lm(Weekly_Gross_Pay_Main_Job~nYear, data=df)
然后,真正关键的部分是我需要将残差提取到我的主数据集中的单独列中,与每个人匹配。这些残差将取代我的以组均值为中心的变量(它又将在我的 MLM 中使用)。
这里是一个可能的起点,使用我的组均值居中函数。如果这可以更新以适合每个人的残差输出的回归,那将是理想的(如果不是,那么我对其他方法持开放态度):
#Group mean-centering a variable. Relevant for L1 variables only.
gmc = function(variable, group){
return(ave(variable, group, FUN = function(x){x - mean(x)}))
}
df$Weekly_Gross_Pay_Main_Jobgmc <- gmc(df$Weekly_Gross_Pay_Main_Job, df$Person_ID)
长格式数据提取(其中Person_ID
是人,nYear
是时间,Weekly_Gross_Pay_Main_Job
是每周income/1000,Weekly_Gross_Pay_Main_Jobgmc
是组-均值居中版本):
structure(list(Person_ID = c(100003L, 100003L, 100003L, 100006L,
100006L, 100006L, 100006L, 100010L, 100010L, 100010L, 100010L,
100010L, 100010L, 100011L, 100014L, 100014L, 100014L, 100014L,
100014L, 100016L, 100018L, 100018L, 100018L, 100018L, 100018L,
100018L, 100018L, 100018L, 100018L, 100020L, 100020L, 100020L,
100020L, 100020L, 100020L, 100020L, 100020L, 100020L, 100021L,
100021L, 100024L, 100024L, 100024L, 100024L, 100024L, 100024L,
100024L, 100024L, 100024L, 100024L, 100025L, 100025L, 100025L,
100025L, 100025L, 100025L, 100025L, 100025L, 100027L, 100027L,
100027L, 100027L, 100029L, 100029L, 100029L, 100029L, 100029L,
100031L, 100031L, 100031L, 100032L, 100032L, 100032L, 100033L,
100033L, 100033L, 100033L, 100033L, 100033L, 100034L, 100034L,
100034L, 100037L, 100037L, 100037L, 100037L, 100037L, 100037L,
100037L, 100044L, 100044L, 100044L, 100044L, 100044L, 100044L,
100044L, 100045L, 100045L, 100045L, 100045L), nYear = c(5L, 6L,
7L, 2L, 3L, 4L, 6L, 5L, 6L, 7L, 8L, 9L, 10L, 1L, 5L, 6L, 7L,
8L, 9L, 5L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 1L, 2L, 3L, 4L,
5L, 6L, 7L, 8L, 9L, 1L, 2L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L,
13L, 14L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 1L, 2L, 3L, 4L, 5L,
6L, 7L, 8L, 9L, 4L, 5L, 6L, 1L, 2L, 3L, 3L, 4L, 5L, 6L, 7L, 8L,
2L, 3L, 5L, 5L, 6L, 7L, 8L, 9L, 11L, 13L, 2L, 3L, 4L, 6L, 7L,
8L, 9L, 4L, 5L, 6L, 7L), Weekly_Gross_Pay_Main_Job = c(0, 0.58,
0.35, 0.035, 0.65, 0.195, 0.43, 0, 0, 0, 0, 0, 0, 0.12, 1.653,
0.967, 1.742, 1.323, 0, 0.709, 0.155, 0.431, 0.235, 0.17, 0.285,
0.357, 0.28, 0.335, 0.375, 0.111, 0.333, 0.582, 0.882, 0.85,
0.944, 1.615, 1.615, 1.35, 0.168, 0.08, 0, 0, 0, 0, 0, 0, 0,
0.134, 0.737, 0, 0.02, 0.372, 0.1, 0.014, 0.307, 0.39, 0.671,
0.5, 0.278, 0.32, 0.425, 0.4, 0.57, 0.917, 0.75, 0.402, 0.437,
0.211, 0.537, 0.54, 0.135, 0.15, 0.65, 0.324, 0.399, 0.497, 0.67,
0.825, 0.825, 0.25, 0.319, 0.35, 0.885, 0.941, 0.975, 0.975,
1.02, 1.096, 1.148, 0.1, 0.11, 0.413, 0.477, 0.578, 0.686, 0.686,
0.511, 0.578, 0.8, 0.75), Weekly_Gross_Pay_Main_Jobgmc = c(-0.31,
0.27, 0.04, -0.2925, 0.3225, -0.1325, 0.1025, 0, 0, 0, 0, 0,
0, 0, 0.516, -0.17, 0.605, 0.186, -1.137, 0, -0.136444444444444,
0.139555555555556, -0.0564444444444445, -0.121444444444444, -0.00644444444444447,
0.0655555555555555, -0.0114444444444444, 0.0435555555555556,
0.0835555555555555, -0.809222222222222, -0.587222222222222, -0.338222222222222,
-0.0382222222222223, -0.0702222222222223, 0.0237777777777777,
0.694777777777778, 0.694777777777778, 0.429777777777778, 0.044,
-0.044, -0.0871, -0.0871, -0.0871, -0.0871, -0.0871, -0.0871,
-0.0871, 0.0469, 0.6499, -0.0871, -0.27675, 0.07525, -0.19675,
-0.28275, 0.01025, 0.09325, 0.37425, 0.20325, -0.07775, -0.03575,
0.06925, 0.04425, -0.0452, 0.3018, 0.1348, -0.2132, -0.1782,
-0.218333333333333, 0.107666666666667, 0.110666666666667, -0.176666666666667,
-0.161666666666667, 0.338333333333333, -0.266, -0.191, -0.093,
0.0800000000000001, 0.235, 0.235, -0.0563333333333333, 0.0126666666666667,
0.0436666666666666, -0.120714285714286, -0.0647142857142858,
-0.0307142857142858, -0.0307142857142858, 0.0142857142857142,
0.0902857142857143, 0.142285714285714, -0.335714285714286, -0.325714285714286,
-0.0227142857142857, 0.0412857142857143, 0.142285714285714, 0.250285714285714,
0.250285714285714, -0.1368, -0.0698000000000001, 0.1522, 0.1022
)), row.names = c(NA, 100L), class = "data.frame")
不确定我是否没看错,这可能是一个非常幼稚的回答,没有抓住要点,但 "residuals" 并不奏效。 这是一个线性混合效应模型,其中包含我周围的一些数据
some.model<-lme(DV~IV, random=~1|Id, data=df)
head(residuals(some.model))
7 7 24 24 32 32
-0.054135825 -0.054135825 0.064271638 0.064271638 -0.001975424 -0.001975424
如果您真的想将其放入旁边带有 ID 编号的列中,则需要执行更多步骤。应该可以一步完成,但是我真的很烂
extra.column<-residuals(some.model)
extra.column.id<-names(residuals(some.model))
extra.column<-residuals(some.model)
cbind(extra.column,extra.column.id)
extra.column extra.column.id
7 "-0.0541358252373243" "7"
7 "-0.0541358252373243" "7"
24 "0.0642716380035857" "24"
24 "0.0642716380035857" "24"
32 "-0.0019754241828096" "32"
32 "-0.0019754241828096" "32"
抱歉,如果这不是您要查找的内容,请查看残差命令。
这是我最终的做法:
#Before you begin, time needs to be grand-mean centered.
df$nYearmc <- df$nYear-mean(df$nYear, na.rm=TRUE)
#Now to regress the time-varying covariate onto grand-mean centered time and complete the process.
#First, create a group called `by_person`.
df <- tidyr::unite(df, Person_Year, c(Person_ID, nYearmc), remove=FALSE)
by_Person <- dplyr::group_by(df, Person_ID)
#Second, regress the time-varying covariate onto the newly created grand-mean centered time variable and merge with the main data frame.
df.Weekly_Gross_Pay_Main_Job <- dplyr::do(by_Person, augment(lm(Weekly_Gross_Pay_Main_Job~nYearmc, data=.)))
df.Weekly_Gross_Pay_Main_Job <- tidyr::unite(df.Weekly_Gross_Pay_Main_Job, Person_Year, c(Person_ID, nYearmc), remove=FALSE)
df <- merge(df, df.Weekly_Gross_Pay_Main_Job, by="Person_Year")
#Third, copy over the required columns (renaming them would be more efficient, but either way).
df$RegResGrossPay <- df$.resid
#Fourth, do an optional tidy up.
colnames(df)[colnames(df)=="Person_ID.x"] <- "Person_ID"
colnames(df)[colnames(df)=="nYearmc.x"] <- "nYearmc"
colnames(df)[colnames(df)=="Weekly_Gross_Pay_Main_Job.x"] <- "Weekly_Gross_Pay_Main_Job"
df$Person_ID.y <- NULL
df$nYearmc.y <- NULL
df$Weekly_Gross_Pay_Main_Job.y <- NULL
df$.fitted <- NULL
df$.se.fit <- NULL
df$.resid <- NULL
df$.hat <- NULL
df$.sigma <- NULL
df$.cooksd <- NULL
df$.std.resid <- NULL
df.Weekly_Gross_Pay_Main_Job <- NULL
#Fifth, generate plots of the variables you need.
ggplot(df, aes(nYearmc, RegResGrossPay))+geom_line(aes(group=Person_ID), alpha =1/3)+geom_smooth(method="lm",se=FALSE)