数据框每列的累积百分比

Cumulative percentage per column of a dataframe

我有一个 data.frame 由多个站点每个儒略日的每日温度组成。

最小可重现示例data.frame:

TemperatureData <- data.frame(
    Julian_Day = 1:365,
    Station_1 = c(rnorm(1:365, mean=10, sd=2)),
    Station_2 = c(rnorm(1:365, mean=10, sd=2)),
    Station_3 = c(rnorm(1:365, mean=10, sd=2))
)

我想确定每个站点超过规定的总累计值百分比的儒略日,并输出一个说明每个站点达到此累计值阈值的儒略日的输出。

例如,假设 Station 1 的总值为 4000,在 180 julian 天后,累计值超过总值的设定 50% 阈值,并对 data.frame 的每一列重复(下面的首选输出示例)。

Station_1   Station_2   Station_3
180         183         179

我假设这会在某种程度上利用 cumsum 函数,但不确定如何实现它。有人可以帮忙吗?

如果这没有意义,请告诉我。

这是一个 tidyverse 的方法。我想有一个更简单的方法,如果我想出来了,我会post它。

library(dplyr)
library(tidyr)
TemperatureData %>% 
  pivot_longer(cols = -Julian_Day, names_to = "Station") %>%
  group_by(Station) %>%
  arrange(Station, Julian_Day) %>%
  mutate(cumpct = cumsum(value) / sum(value)) %>%
  filter(cumpct >= 0.5) %>%
  slice(1) %>%
  pivot_wider(id_cols = 1, names_from = Station, values_from = Julian_Day)

# A tibble: 1 x 3
  Station_1 Station_2 Station_3
      <int>     <int>     <int>
1       184       181       181

基础 R 解决方案:

TemperatureData <- data.frame(
    Julian_Day = 1:365,
    Station_1 = c(rnorm(1:365, mean=10, sd=2)),
    Station_2 = c(rnorm(1:365, mean=10, sd=2)),
    Station_3 = c(rnorm(1:365, mean=10, sd=2))
)

TemperatureData$Station_1 <- cumsum(TemperatureData$Station_1) / sum(TemperatureData$Station_1)
TemperatureData$Station_2 <- cumsum(TemperatureData$Station_2) / sum(TemperatureData$Station_2)
TemperatureData$Station_3 <- cumsum(TemperatureData$Station_3) / sum(TemperatureData$Station_3)


results <- c(
  "Station 1" = TemperatureData$Julian_Day[TemperatureData$Station_1 >= .5][1],
  "Station 2" = TemperatureData$Julian_Day[TemperatureData$Station_2 >= .5][1],
  "Station 3" = TemperatureData$Julian_Day[TemperatureData$Station_3 >= .5][1]
)
results
#> Station 1 Station 2 Station 3 
#>       180       185       183

tidyverse解法:

library(dplyr)
TemperatureData %>%
  summarize(across(matches("Station"), 
                   function(x) Julian_Day[cumsum(x) / sum(x) > .5][1]))

data.table解法:

library(data.table)

setDT(TemperatureData)

TemperatureData[, lapply(.SD, function(x) Julian_Day[cumsum(x) / sum(x) > .5][1]), 
                .SDcols=patterns("Station")]