意外的 dply() 输出。未按需要分组
Unexpected dply() output. Not grouping as desired
我是 R 新手,用于尝试分析植物物种的一些人口统计数据。我的数据框包括:
TagKey(唯一标识符)、Year(观察年)、TagEstablished(首次发现植物的年份)和 StageClass(0=死亡,1=幼苗,2=植物,3=繁殖)。参观工厂的每一年都有一行,但我希望每个工厂有一行,然后是每年的状态列。这是为了逐年跟踪个人的状态。
示例数据:
TagKey <- c(PDPLM040J0_ALIFOR01_Belt_0, PDPLM040J0_ALIFOR01_Belt_0, PDPLM040J0_ALIFOR01_Belt_0, PDPLM040J0_ALIFOR01_Belt_1, PDPLM040J0_ALIFOR01_Belt_1, PDPLM040J0_ALIFOR01_Belt_1)
Year <- c(2020, 2020, 2020, 2021, 2021, 2021)
TagEstablished <- c(2020, 2020, 2020, 2020, 2020, 2020)
StageClass <- c(1, 2, 3, 0, 3, 3)
ALFO_stages <- data.frame(TagKey, Year, TagEstablished, StageClass)
我试过使用ddply:
ALFO_status <- ddply(ALFO_stages, .(TagKey), dplyr::summarize,
Year_Established = TagEstablished,
Status2020 = if(Year=="2020") {StageClass},
Status2021 = if(Year=="2021") {StageClass})
我的输出没有按照需要按 TagKey 分组。各自年份的输出是正确的,但不适用的年份只是吐出 NA。帮忙?
基于这句话:“参观工厂的每一年都有一行,但我希望每个工厂有 1 行,然后是每年状态的列。”听起来您想要的是重塑或旋转数据。
'Group by' 往往是汇总数据的一部分。例如。计算每年的记录数,将涉及按年份分组。旋转或重塑是将列内容转换为列标签或反之亦然的过程。
在 R 中,我会推荐 tidyr
包。也许是这样的:
TagKey <- c("PDPLM040J0_ALIFOR01_Belt_0", "PDPLM040J0_ALIFOR01_Belt_0", "PDPLM040J0_ALIFOR01_Belt_0", "PDPLM040J0_ALIFOR01_Belt_1", "PDPLM040J0_ALIFOR01_Belt_1", "PDPLM040J0_ALIFOR01_Belt_1")
Year <- c(2018, 2019, 2020, 2019, 2020, 2021) # NOTE editted for unique year for each tree
TagEstablished <- c(2020, 2020, 2020, 2020, 2020, 2020)
StageClass <- c(1, 2, 3, 0, 3, 3)
ALFO_stages <- data.frame(TagKey, Year, TagEstablished, StageClass)
library(tidyr)
library(dplyr)
ALFO_stages %>% pivot_wider(id_cols = c(TagKey, TagEstablished), names_from = Year, values_from = StageClass)
这会产生:
TagKey TagEstablished `2018` `2019` `2020` `2021`
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 PDPLM040J0_ALIFOR01_Belt_0 2020 1 2 3 NA
2 PDPLM040J0_ALIFOR01_Belt_1 2020 NA 0 3 3
或者,您可以使用大量 ifelse
语句手动执行此操作:
ALFO_stages %>%
group_by(TagKey, TagEstablished) %>%
summarise(y2018 = max(ifelse(Year == 2018, StageClass, NA), na.rm = TRUE),
y2019 = max(ifelse(Year == 2019, StageClass, NA), na.rm = TRUE),
y2020 = max(ifelse(Year == 2020, StageClass, NA), na.rm = TRUE),
y2021 = max(ifelse(Year == 2021, StageClass, NA), na.rm = TRUE))
这两段代码产生相同的答案(但 NA 对缺失值的处理不同)。
我是 R 新手,用于尝试分析植物物种的一些人口统计数据。我的数据框包括:
TagKey(唯一标识符)、Year(观察年)、TagEstablished(首次发现植物的年份)和 StageClass(0=死亡,1=幼苗,2=植物,3=繁殖)。参观工厂的每一年都有一行,但我希望每个工厂有一行,然后是每年的状态列。这是为了逐年跟踪个人的状态。
示例数据:
TagKey <- c(PDPLM040J0_ALIFOR01_Belt_0, PDPLM040J0_ALIFOR01_Belt_0, PDPLM040J0_ALIFOR01_Belt_0, PDPLM040J0_ALIFOR01_Belt_1, PDPLM040J0_ALIFOR01_Belt_1, PDPLM040J0_ALIFOR01_Belt_1)
Year <- c(2020, 2020, 2020, 2021, 2021, 2021)
TagEstablished <- c(2020, 2020, 2020, 2020, 2020, 2020)
StageClass <- c(1, 2, 3, 0, 3, 3)
ALFO_stages <- data.frame(TagKey, Year, TagEstablished, StageClass)
我试过使用ddply:
ALFO_status <- ddply(ALFO_stages, .(TagKey), dplyr::summarize,
Year_Established = TagEstablished,
Status2020 = if(Year=="2020") {StageClass},
Status2021 = if(Year=="2021") {StageClass})
我的输出没有按照需要按 TagKey 分组。各自年份的输出是正确的,但不适用的年份只是吐出 NA。帮忙?
基于这句话:“参观工厂的每一年都有一行,但我希望每个工厂有 1 行,然后是每年状态的列。”听起来您想要的是重塑或旋转数据。
'Group by' 往往是汇总数据的一部分。例如。计算每年的记录数,将涉及按年份分组。旋转或重塑是将列内容转换为列标签或反之亦然的过程。
在 R 中,我会推荐 tidyr
包。也许是这样的:
TagKey <- c("PDPLM040J0_ALIFOR01_Belt_0", "PDPLM040J0_ALIFOR01_Belt_0", "PDPLM040J0_ALIFOR01_Belt_0", "PDPLM040J0_ALIFOR01_Belt_1", "PDPLM040J0_ALIFOR01_Belt_1", "PDPLM040J0_ALIFOR01_Belt_1")
Year <- c(2018, 2019, 2020, 2019, 2020, 2021) # NOTE editted for unique year for each tree
TagEstablished <- c(2020, 2020, 2020, 2020, 2020, 2020)
StageClass <- c(1, 2, 3, 0, 3, 3)
ALFO_stages <- data.frame(TagKey, Year, TagEstablished, StageClass)
library(tidyr)
library(dplyr)
ALFO_stages %>% pivot_wider(id_cols = c(TagKey, TagEstablished), names_from = Year, values_from = StageClass)
这会产生:
TagKey TagEstablished `2018` `2019` `2020` `2021`
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 PDPLM040J0_ALIFOR01_Belt_0 2020 1 2 3 NA
2 PDPLM040J0_ALIFOR01_Belt_1 2020 NA 0 3 3
或者,您可以使用大量 ifelse
语句手动执行此操作:
ALFO_stages %>%
group_by(TagKey, TagEstablished) %>%
summarise(y2018 = max(ifelse(Year == 2018, StageClass, NA), na.rm = TRUE),
y2019 = max(ifelse(Year == 2019, StageClass, NA), na.rm = TRUE),
y2020 = max(ifelse(Year == 2020, StageClass, NA), na.rm = TRUE),
y2021 = max(ifelse(Year == 2021, StageClass, NA), na.rm = TRUE))
这两段代码产生相同的答案(但 NA 对缺失值的处理不同)。