R- 一个数据框中的多个线性回归,其中包含因子和 NA
R- Several linear regressions in one dataframe with factors and NA's
我是 R 的新手,我必须处理超过 100 列的数据集,简化如下:
Station time data1 data2 data3 data4.....
1 0.0 35.02430310 44.2229390 NA
1 0.8 -68.75294241 -85.5847503 NA
1 1.8 -43.10200333 -62.8035400 NA
3 0.0 0.02217693 0.1336396 0.03203031
3 0.9 7.84203118 -6.4854953 6.22910506
3 2.2 -0.41682970 -7.7022785 0.92807170
17 0.0 4.24864888 4.2104517 0.00000000
17 0.9 1.79933934 -6.6360999 -10.10756894
17 2.1 1.99226283 2.2676248 -13.15887674
对于每个 data
列,我想用 time
进行线性回归,但我需要每个站的系数(它们是因子)。来自我使用的plyr
包
ddply(dataframe, .(Station), function(z) coef(lm(data1 ~ time, data=z)))
例如 data1
:
Station (Intercept) t.h.
1 1 9.674588 -40.5399850
2 37 3.130705 -0.6284611
3 48 3.657316 -0.9474062
这将是我需要系数的方式,但对于每个 data
列。现在,即使我将此代码用于每个 data
列,我也会遇到具有 NA 值的列的问题。我想简单地删除这些站,但仅针对特定列(在这种情况下仅针对 data3
。对于 data1
和 data2
我想保留站 1.
有解决办法吗?如有任何建议,我们将不胜感激。
数据dput
:
structure(list(Station = structure(c(1L, 1L, 1L, 2L, 2L, 2L,
3L, 3L, 3L), .Label = c("1", "3", "17"), class = "factor"), time = c(0,
0.8, 1.8, 0, 0.9, 2.2, 0, 0.9, 2.2), data1 = c(35.0243031, -68.75294241,
-43.10200333, 0.02217693, 7.84203118, -0.4168297, 4.24864888,
1.79933934, 1.99226283), data2 = c(44.222939, -85.5847503, -62.80354,
0.1336396, -6.4854953, -7.7022785, 4.2104517, -6.6360999, 2.2676248
), data3 = c(NA, NA, NA, 0.1410939, 30.0332505, 11.449285, 0.1161954,
-2.061781, 0.2289149)), .Names = c("Station", "time", "data1",
"data2", "data3"), row.names = c(NA, -9L), class = "data.frame")
使用complete.cases
会导致Station 1完全掉线,这是你想要的吗
DF=read.table(text="Station time data1 data2 data3
1 0.0 35.02430310 44.2229390 NA
1 0.8 -68.75294241 -85.5847503 NA
1 1.8 -43.10200333 -62.8035400 NA
3 0.0 0.02217693 0.1336396 0.03203031
3 0.9 7.84203118 -6.4854953 6.22910506
3 2.2 -0.41682970 -7.7022785 0.92807170
17 0.0 4.24864888 4.2104517 0.00000000
17 0.9 1.79933934 -6.6360999 -10.10756894
17 2.1 1.99226283 2.2676248 -13.15887674",header=TRUE,stringsAsFactors=FALSE)
ddply(DF, .(Station), function(z) z[complete.cases(z),])
# Station time data1 data2 data3
#1 3 0.0 0.02217693 0.1336396 0.03203031
#2 3 0.9 7.84203118 -6.4854953 6.22910506
#3 3 2.2 -0.41682970 -7.7022785 0.92807170
#4 17 0.0 4.24864888 4.2104517 0.00000000
#5 17 0.9 1.79933934 -6.6360999 -10.10756894
#6 17 2.1 1.99226283 2.2676248 -13.15887674
我们需要先将您的 data.frame
重塑为长格式,然后省略 NA
值,然后根据唯一键('data'
和 Station
)应用模型,最后整理 lm()
调用的输出。
library(tidyr)
library(broom)
df %>% gather(data, value, -c(Station, time)) %>%
na.omit() %>%
group_by(data, Station) %>%
do(tidy(coef(lm(value ~ time, data = .)))) %>%
spread(names, x)
# data Station `(Intercept)` time
#* <chr> <fctr> <dbl> <dbl>
#1 data1 1 9.5534021 -40.5734035
#2 data1 3 3.1391280 -0.6354857
#3 data1 17 3.6539549 -0.9424560
#4 data2 1 13.8883780 -56.0886482
#5 data2 3 -1.1964287 -3.3757574
#6 data2 17 0.2938263 -0.3353234
#7 data3 3 9.9859146 3.7631889
#8 data3 17 -0.7504115 0.1724399
使用的示例数据是您共享到第 data3
列的数据。
我是 R 的新手,我必须处理超过 100 列的数据集,简化如下:
Station time data1 data2 data3 data4.....
1 0.0 35.02430310 44.2229390 NA
1 0.8 -68.75294241 -85.5847503 NA
1 1.8 -43.10200333 -62.8035400 NA
3 0.0 0.02217693 0.1336396 0.03203031
3 0.9 7.84203118 -6.4854953 6.22910506
3 2.2 -0.41682970 -7.7022785 0.92807170
17 0.0 4.24864888 4.2104517 0.00000000
17 0.9 1.79933934 -6.6360999 -10.10756894
17 2.1 1.99226283 2.2676248 -13.15887674
对于每个 data
列,我想用 time
进行线性回归,但我需要每个站的系数(它们是因子)。来自我使用的plyr
包
ddply(dataframe, .(Station), function(z) coef(lm(data1 ~ time, data=z)))
例如 data1
:
Station (Intercept) t.h.
1 1 9.674588 -40.5399850
2 37 3.130705 -0.6284611
3 48 3.657316 -0.9474062
这将是我需要系数的方式,但对于每个 data
列。现在,即使我将此代码用于每个 data
列,我也会遇到具有 NA 值的列的问题。我想简单地删除这些站,但仅针对特定列(在这种情况下仅针对 data3
。对于 data1
和 data2
我想保留站 1.
有解决办法吗?如有任何建议,我们将不胜感激。
数据dput
:
structure(list(Station = structure(c(1L, 1L, 1L, 2L, 2L, 2L,
3L, 3L, 3L), .Label = c("1", "3", "17"), class = "factor"), time = c(0,
0.8, 1.8, 0, 0.9, 2.2, 0, 0.9, 2.2), data1 = c(35.0243031, -68.75294241,
-43.10200333, 0.02217693, 7.84203118, -0.4168297, 4.24864888,
1.79933934, 1.99226283), data2 = c(44.222939, -85.5847503, -62.80354,
0.1336396, -6.4854953, -7.7022785, 4.2104517, -6.6360999, 2.2676248
), data3 = c(NA, NA, NA, 0.1410939, 30.0332505, 11.449285, 0.1161954,
-2.061781, 0.2289149)), .Names = c("Station", "time", "data1",
"data2", "data3"), row.names = c(NA, -9L), class = "data.frame")
使用complete.cases
会导致Station 1完全掉线,这是你想要的吗
DF=read.table(text="Station time data1 data2 data3
1 0.0 35.02430310 44.2229390 NA
1 0.8 -68.75294241 -85.5847503 NA
1 1.8 -43.10200333 -62.8035400 NA
3 0.0 0.02217693 0.1336396 0.03203031
3 0.9 7.84203118 -6.4854953 6.22910506
3 2.2 -0.41682970 -7.7022785 0.92807170
17 0.0 4.24864888 4.2104517 0.00000000
17 0.9 1.79933934 -6.6360999 -10.10756894
17 2.1 1.99226283 2.2676248 -13.15887674",header=TRUE,stringsAsFactors=FALSE)
ddply(DF, .(Station), function(z) z[complete.cases(z),])
# Station time data1 data2 data3
#1 3 0.0 0.02217693 0.1336396 0.03203031
#2 3 0.9 7.84203118 -6.4854953 6.22910506
#3 3 2.2 -0.41682970 -7.7022785 0.92807170
#4 17 0.0 4.24864888 4.2104517 0.00000000
#5 17 0.9 1.79933934 -6.6360999 -10.10756894
#6 17 2.1 1.99226283 2.2676248 -13.15887674
我们需要先将您的 data.frame
重塑为长格式,然后省略 NA
值,然后根据唯一键('data'
和 Station
)应用模型,最后整理 lm()
调用的输出。
library(tidyr)
library(broom)
df %>% gather(data, value, -c(Station, time)) %>%
na.omit() %>%
group_by(data, Station) %>%
do(tidy(coef(lm(value ~ time, data = .)))) %>%
spread(names, x)
# data Station `(Intercept)` time
#* <chr> <fctr> <dbl> <dbl>
#1 data1 1 9.5534021 -40.5734035
#2 data1 3 3.1391280 -0.6354857
#3 data1 17 3.6539549 -0.9424560
#4 data2 1 13.8883780 -56.0886482
#5 data2 3 -1.1964287 -3.3757574
#6 data2 17 0.2938263 -0.3353234
#7 data3 3 9.9859146 3.7631889
#8 data3 17 -0.7504115 0.1724399
使用的示例数据是您共享到第 data3
列的数据。