使用 tibble 按组将线性模型应用于新数据
Using a tibble to apply linear models by group to new data
假设我有两个数据集用于两年内的同一组鸢尾花:
# Create data for reproducible results.
iris.2007 <- iris
iris.2008 <- iris
iris.2008[1:4] <- 2*iris.2008[1:4] # let's make the 2008 data different
我想为 2007 年数据中的每个物种拟合一个单独的线性模型,我可以这样做:
# First nest by Species.
iris.2007.nested <- iris.2007 %>%
group_by(Species) %>%
nest()
# Now apply the linear model call by group using the data.
iris.2007.nested <- iris.2007.nested %>%
mutate(models = map(data,
~ lm(Petal.Length ~ Petal.Width, data = .)))
当我们查看结果时,它们作为一个 nicely-organized tibble 是有道理的。
head(iris.2007.nested)
# A tibble: 3 × 3
Species data models
<fctr> <list> <list>
1 setosa <tibble [50 × 4]> <S3: lm>
2 versicolor <tibble [50 × 4]> <S3: lm>
3 virginica <tibble [50 × 4]> <S3: lm>
现在让我们对 2008 年的数据做同样的事情。
# First nest by species.
iris.2008.nested <- iris.2008 %>%
group_by(Species) %>%
nest()
# Now apply the linear model call by species using the data.
iris.2008.nested <- iris.2008.nested %>%
mutate(models = map(data,
~ lm(Petal.Length ~ Petal.Width, data = .)))
再一次,我们得到了一个很好的小标题。
head(iris.2008.nested)
# A tibble: 3 × 3
Species data models
<fctr> <list> <list>
1 setosa <tibble [50 × 4]> <S3: lm>
2 versicolor <tibble [50 × 4]> <S3: lm>
3 virginica <tibble [50 × 4]> <S3: lm>
现在我想做的是使用 2008 年数据的线性模型来预测使用 2007 年数据的结果。认为最好的方法是合并两个数据集(保留组结构),这是当我尝试合并两个嵌套的 tibbles 时发生的情况:
iris.both.nested <- merge(iris.2007.nested, iris.2008.nested, by='Species')
正如您在下面看到的,小标题似乎不再与上面的各个小标题具有相同的格式。具体来说,组织很难辨别(请注意,我没有在这个块中包括完整的输出,但你明白了)。
head(iris.both.nested)
Species
1 setosa
2 versicolor
3 virginica
data.x
1 5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, ...
... <truncated>
1 1.327563, 0.5464903, -0.03686145, -0.03686145, -0.1368614, 0.06313855,
...
虽然我显然仍然可以使用适合 2008 年数据(如 models.y)的模型到 2007 年的数据(如 data.x):
iris.both.nested.pred <- iris.both.nested %>%
mutate( pred = map2(models.y,
data.x, predict))
结果再次不是 nicely-organized tibble:(再次未显示完整输出)
head(iris.both.nested.pred)
Species
1 setosa
2 versicolor
3 virginica
data.x
1 5.1, 4.9, 4.7, 4.6, 5.0, 5.4, ...
... <truncated>
1 1.327563, 0.5464903, -0.03686145, -0.03686145, -0.1368614,
...
所以我的问题是——即使在合并后 tibbles 变得奇怪地组织起来,这个过程是否有效?或者我错过了什么?谢谢!
我会先嵌套它,然后再应用模型
# Data
iris.2007 <- iris
iris.2008 <- iris
iris.2008[1:4] <- 2*iris.2008[1:4]
joined<-bind_rows(
cbind(dset=rep("iris.2007",length(iris.2007$Species)),iris.2007)
,cbind(dset=rep("iris.2008",length(iris.2008$Species)),iris.2008)
)
# Double nesting
joined_nested<-
joined %>% group_by(dset) %>% nest(.key=data1) %>%
mutate(data1 = map(data1, ~.x %>% group_by(Species) %>% nest))
# Now apply the linear model call by group using the data.
joined_nested_models<-
joined_nested %>% mutate(data1 = map(data1, ~.x %>%
mutate(models = map(data,
~ lm(Petal.Length ~ Petal.Width, data = .)))
))
joined_nested_models %>% unnest
# # A tibble: 6 × 4
# dset Species data models
# <chr> <fctr> <list> <list>
# 1 iris.2007 setosa <tibble [50 × 4]> <S3: lm>
# 2 iris.2007 versicolor <tibble [50 × 4]> <S3: lm>
# 3 iris.2007 virginica <tibble [50 × 4]> <S3: lm>
# 4 iris.2008 setosa <tibble [50 × 4]> <S3: lm>
# 5 iris.2008 versicolor <tibble [50 × 4]> <S3: lm>
# 6 iris.2008 virginica <tibble [50 × 4]> <S3: lm>
这是您使用 inner_join
获得的 Tidier 版本
# create data from example
iris.2007 <- iris
iris.2008 <- iris
iris.2008[1:4] <- 2*iris.2008[1:4]
iris.2007 <- iris
iris.2008 <- iris
iris.2008[1:4] <- 2*iris.2008[1:4]
# combine data
irisAllData <- iris.2007 %>%
mutate(year = 2007) %>%
bind_rows(mutate(iris.2008, year = 2008)) %>%
group_by(Species) %>%
nest()
# model and predict
irisPredict <- irisAllData %>%
mutate(modelData = data %>% map(., ~filter(., year == 2007))
,validationData = data %>% map(., ~filter(., year == 2008))
,model = modelData %>% map(., ~lm(Petal.Length ~ Petal.Width, data = .))
,prediction = map2(.x = model, .y = validationData, ~predict(object = .x, newdata = .y))) %>%
select(Species, prediction) %>%
unnest()
假设我有两个数据集用于两年内的同一组鸢尾花:
# Create data for reproducible results.
iris.2007 <- iris
iris.2008 <- iris
iris.2008[1:4] <- 2*iris.2008[1:4] # let's make the 2008 data different
我想为 2007 年数据中的每个物种拟合一个单独的线性模型,我可以这样做:
# First nest by Species.
iris.2007.nested <- iris.2007 %>%
group_by(Species) %>%
nest()
# Now apply the linear model call by group using the data.
iris.2007.nested <- iris.2007.nested %>%
mutate(models = map(data,
~ lm(Petal.Length ~ Petal.Width, data = .)))
当我们查看结果时,它们作为一个 nicely-organized tibble 是有道理的。
head(iris.2007.nested)
# A tibble: 3 × 3
Species data models
<fctr> <list> <list>
1 setosa <tibble [50 × 4]> <S3: lm>
2 versicolor <tibble [50 × 4]> <S3: lm>
3 virginica <tibble [50 × 4]> <S3: lm>
现在让我们对 2008 年的数据做同样的事情。
# First nest by species.
iris.2008.nested <- iris.2008 %>%
group_by(Species) %>%
nest()
# Now apply the linear model call by species using the data.
iris.2008.nested <- iris.2008.nested %>%
mutate(models = map(data,
~ lm(Petal.Length ~ Petal.Width, data = .)))
再一次,我们得到了一个很好的小标题。
head(iris.2008.nested)
# A tibble: 3 × 3
Species data models
<fctr> <list> <list>
1 setosa <tibble [50 × 4]> <S3: lm>
2 versicolor <tibble [50 × 4]> <S3: lm>
3 virginica <tibble [50 × 4]> <S3: lm>
现在我想做的是使用 2008 年数据的线性模型来预测使用 2007 年数据的结果。认为最好的方法是合并两个数据集(保留组结构),这是当我尝试合并两个嵌套的 tibbles 时发生的情况:
iris.both.nested <- merge(iris.2007.nested, iris.2008.nested, by='Species')
正如您在下面看到的,小标题似乎不再与上面的各个小标题具有相同的格式。具体来说,组织很难辨别(请注意,我没有在这个块中包括完整的输出,但你明白了)。
head(iris.both.nested)
Species
1 setosa
2 versicolor
3 virginica
data.x
1 5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, ...
... <truncated>
1 1.327563, 0.5464903, -0.03686145, -0.03686145, -0.1368614, 0.06313855,
...
虽然我显然仍然可以使用适合 2008 年数据(如 models.y)的模型到 2007 年的数据(如 data.x):
iris.both.nested.pred <- iris.both.nested %>%
mutate( pred = map2(models.y,
data.x, predict))
结果再次不是 nicely-organized tibble:(再次未显示完整输出)
head(iris.both.nested.pred)
Species
1 setosa
2 versicolor
3 virginica
data.x
1 5.1, 4.9, 4.7, 4.6, 5.0, 5.4, ...
... <truncated>
1 1.327563, 0.5464903, -0.03686145, -0.03686145, -0.1368614,
...
所以我的问题是——即使在合并后 tibbles 变得奇怪地组织起来,这个过程是否有效?或者我错过了什么?谢谢!
我会先嵌套它,然后再应用模型
# Data
iris.2007 <- iris
iris.2008 <- iris
iris.2008[1:4] <- 2*iris.2008[1:4]
joined<-bind_rows(
cbind(dset=rep("iris.2007",length(iris.2007$Species)),iris.2007)
,cbind(dset=rep("iris.2008",length(iris.2008$Species)),iris.2008)
)
# Double nesting
joined_nested<-
joined %>% group_by(dset) %>% nest(.key=data1) %>%
mutate(data1 = map(data1, ~.x %>% group_by(Species) %>% nest))
# Now apply the linear model call by group using the data.
joined_nested_models<-
joined_nested %>% mutate(data1 = map(data1, ~.x %>%
mutate(models = map(data,
~ lm(Petal.Length ~ Petal.Width, data = .)))
))
joined_nested_models %>% unnest
# # A tibble: 6 × 4
# dset Species data models
# <chr> <fctr> <list> <list>
# 1 iris.2007 setosa <tibble [50 × 4]> <S3: lm>
# 2 iris.2007 versicolor <tibble [50 × 4]> <S3: lm>
# 3 iris.2007 virginica <tibble [50 × 4]> <S3: lm>
# 4 iris.2008 setosa <tibble [50 × 4]> <S3: lm>
# 5 iris.2008 versicolor <tibble [50 × 4]> <S3: lm>
# 6 iris.2008 virginica <tibble [50 × 4]> <S3: lm>
这是您使用 inner_join
# create data from example
iris.2007 <- iris
iris.2008 <- iris
iris.2008[1:4] <- 2*iris.2008[1:4]
iris.2007 <- iris
iris.2008 <- iris
iris.2008[1:4] <- 2*iris.2008[1:4]
# combine data
irisAllData <- iris.2007 %>%
mutate(year = 2007) %>%
bind_rows(mutate(iris.2008, year = 2008)) %>%
group_by(Species) %>%
nest()
# model and predict
irisPredict <- irisAllData %>%
mutate(modelData = data %>% map(., ~filter(., year == 2007))
,validationData = data %>% map(., ~filter(., year == 2008))
,model = modelData %>% map(., ~lm(Petal.Length ~ Petal.Width, data = .))
,prediction = map2(.x = model, .y = validationData, ~predict(object = .x, newdata = .y))) %>%
select(Species, prediction) %>%
unnest()