数据框每列的累积百分比
Cumulative percentage per column of a dataframe
我有一个 data.frame 由多个站点每个儒略日的每日温度组成。
最小可重现示例data.frame:
TemperatureData <- data.frame(
Julian_Day = 1:365,
Station_1 = c(rnorm(1:365, mean=10, sd=2)),
Station_2 = c(rnorm(1:365, mean=10, sd=2)),
Station_3 = c(rnorm(1:365, mean=10, sd=2))
)
我想确定每个站点超过规定的总累计值百分比的儒略日,并输出一个说明每个站点达到此累计值阈值的儒略日的输出。
例如,假设 Station 1 的总值为 4000,在 180 julian 天后,累计值超过总值的设定 50% 阈值,并对 data.frame 的每一列重复(下面的首选输出示例)。
Station_1 Station_2 Station_3
180 183 179
我假设这会在某种程度上利用 cumsum 函数,但不确定如何实现它。有人可以帮忙吗?
如果这没有意义,请告诉我。
这是一个 tidyverse
的方法。我想有一个更简单的方法,如果我想出来了,我会post它。
library(dplyr)
library(tidyr)
TemperatureData %>%
pivot_longer(cols = -Julian_Day, names_to = "Station") %>%
group_by(Station) %>%
arrange(Station, Julian_Day) %>%
mutate(cumpct = cumsum(value) / sum(value)) %>%
filter(cumpct >= 0.5) %>%
slice(1) %>%
pivot_wider(id_cols = 1, names_from = Station, values_from = Julian_Day)
# A tibble: 1 x 3
Station_1 Station_2 Station_3
<int> <int> <int>
1 184 181 181
基础 R 解决方案:
TemperatureData <- data.frame(
Julian_Day = 1:365,
Station_1 = c(rnorm(1:365, mean=10, sd=2)),
Station_2 = c(rnorm(1:365, mean=10, sd=2)),
Station_3 = c(rnorm(1:365, mean=10, sd=2))
)
TemperatureData$Station_1 <- cumsum(TemperatureData$Station_1) / sum(TemperatureData$Station_1)
TemperatureData$Station_2 <- cumsum(TemperatureData$Station_2) / sum(TemperatureData$Station_2)
TemperatureData$Station_3 <- cumsum(TemperatureData$Station_3) / sum(TemperatureData$Station_3)
results <- c(
"Station 1" = TemperatureData$Julian_Day[TemperatureData$Station_1 >= .5][1],
"Station 2" = TemperatureData$Julian_Day[TemperatureData$Station_2 >= .5][1],
"Station 3" = TemperatureData$Julian_Day[TemperatureData$Station_3 >= .5][1]
)
results
#> Station 1 Station 2 Station 3
#> 180 185 183
tidyverse
解法:
library(dplyr)
TemperatureData %>%
summarize(across(matches("Station"),
function(x) Julian_Day[cumsum(x) / sum(x) > .5][1]))
data.table
解法:
library(data.table)
setDT(TemperatureData)
TemperatureData[, lapply(.SD, function(x) Julian_Day[cumsum(x) / sum(x) > .5][1]),
.SDcols=patterns("Station")]
我有一个 data.frame 由多个站点每个儒略日的每日温度组成。
最小可重现示例data.frame:
TemperatureData <- data.frame(
Julian_Day = 1:365,
Station_1 = c(rnorm(1:365, mean=10, sd=2)),
Station_2 = c(rnorm(1:365, mean=10, sd=2)),
Station_3 = c(rnorm(1:365, mean=10, sd=2))
)
我想确定每个站点超过规定的总累计值百分比的儒略日,并输出一个说明每个站点达到此累计值阈值的儒略日的输出。
例如,假设 Station 1 的总值为 4000,在 180 julian 天后,累计值超过总值的设定 50% 阈值,并对 data.frame 的每一列重复(下面的首选输出示例)。
Station_1 Station_2 Station_3
180 183 179
我假设这会在某种程度上利用 cumsum 函数,但不确定如何实现它。有人可以帮忙吗?
如果这没有意义,请告诉我。
这是一个 tidyverse
的方法。我想有一个更简单的方法,如果我想出来了,我会post它。
library(dplyr)
library(tidyr)
TemperatureData %>%
pivot_longer(cols = -Julian_Day, names_to = "Station") %>%
group_by(Station) %>%
arrange(Station, Julian_Day) %>%
mutate(cumpct = cumsum(value) / sum(value)) %>%
filter(cumpct >= 0.5) %>%
slice(1) %>%
pivot_wider(id_cols = 1, names_from = Station, values_from = Julian_Day)
# A tibble: 1 x 3
Station_1 Station_2 Station_3
<int> <int> <int>
1 184 181 181
基础 R 解决方案:
TemperatureData <- data.frame(
Julian_Day = 1:365,
Station_1 = c(rnorm(1:365, mean=10, sd=2)),
Station_2 = c(rnorm(1:365, mean=10, sd=2)),
Station_3 = c(rnorm(1:365, mean=10, sd=2))
)
TemperatureData$Station_1 <- cumsum(TemperatureData$Station_1) / sum(TemperatureData$Station_1)
TemperatureData$Station_2 <- cumsum(TemperatureData$Station_2) / sum(TemperatureData$Station_2)
TemperatureData$Station_3 <- cumsum(TemperatureData$Station_3) / sum(TemperatureData$Station_3)
results <- c(
"Station 1" = TemperatureData$Julian_Day[TemperatureData$Station_1 >= .5][1],
"Station 2" = TemperatureData$Julian_Day[TemperatureData$Station_2 >= .5][1],
"Station 3" = TemperatureData$Julian_Day[TemperatureData$Station_3 >= .5][1]
)
results
#> Station 1 Station 2 Station 3
#> 180 185 183
tidyverse
解法:
library(dplyr)
TemperatureData %>%
summarize(across(matches("Station"),
function(x) Julian_Day[cumsum(x) / sum(x) > .5][1]))
data.table
解法:
library(data.table)
setDT(TemperatureData)
TemperatureData[, lapply(.SD, function(x) Julian_Day[cumsum(x) / sum(x) > .5][1]),
.SDcols=patterns("Station")]