R:如何根据最近的 N 行值生成具有行值的列
R: How to generate a column with row values based on the nearest N row's values
我正在寻找一种方法来将前 N 行中基于列的信息编码到给定行。数据集已排序。
简而言之,我想创建一个名为 oneweeksince
的列,如果 returns TRUE
如果 victims
列大于 0(或 !NA)七之后的行。
换句话说,如果对于 row[i]
,row[i]$victims > 0
在 row[i - 7]
到 row[i]
的任何一行中,那么 row[i]$oneweeksince
应该是 TRUE
.在 victims > 0
或 !is.na(victims)
的行上,oneweeksince
值也应为 TRUE
如何自动创建这个 column/feature?也可以使用日期列来计算日期距离。由于 R 中的性能较慢,我试图避免创建循环。
数据集应如下所示:
date oneweeksince victims
1 2009-01-01 FALSE NA
2 2009-01-02 FALSE NA
3 2009-01-03 FALSE NA
4 2009-01-04 FALSE NA
5 2009-01-05 FALSE NA
6 2009-01-06 FALSE NA
7 2009-01-07 FALSE NA
8 2009-01-08 TRUE 1
9 2009-01-09 TRUE NA
10 2009-01-10 TRUE NA
11 2009-01-11 TRUE NA
12 2009-01-12 TRUE NA
13 2009-01-13 TRUE NA
14 2009-01-14 TRUE NA
15 2009-01-15 TRUE NA
16 2009-01-16 FALSE NA
17 2009-01-17 FALSE NA
18 2009-01-18 FALSE NA
19 2009-01-19 FALSE NA
20 2009-01-20 FALSE NA
数据集很多年了,所以我需要一种有效的方法来完成它。
我们可以做一个滚动求和并测试它是否大于 0:
library(RcppRoll)
your_data$result = roll_sum(
x = your_data$victims,
n = 8,
na.rm = TRUE,
fill = 0,
align = "right"
) > 0
your_data
# date oneweeksince victims result
# 1 2009-01-01 FALSE NA FALSE
# 2 2009-01-02 FALSE NA FALSE
# 3 2009-01-03 FALSE NA FALSE
# 4 2009-01-04 FALSE NA FALSE
# 5 2009-01-05 FALSE NA FALSE
# 6 2009-01-06 FALSE NA FALSE
# 7 2009-01-07 FALSE NA FALSE
# 8 2009-01-08 TRUE 1 TRUE
# 9 2009-01-09 TRUE NA TRUE
# 10 2009-01-10 TRUE NA TRUE
# 11 2009-01-11 TRUE NA TRUE
# 12 2009-01-12 TRUE NA TRUE
# 13 2009-01-13 TRUE NA TRUE
# 14 2009-01-14 TRUE NA TRUE
# 15 2009-01-15 TRUE NA TRUE
# 16 2009-01-16 FALSE NA FALSE
# 17 2009-01-17 FALSE NA FALSE
# 18 2009-01-18 FALSE NA FALSE
# 19 2009-01-19 FALSE NA FALSE
# 20 2009-01-20 FALSE NA FALSE
使用此数据:
your_data = read.table(header = T, text = ' date oneweeksince victims
1 2009-01-01 FALSE NA
2 2009-01-02 FALSE NA
3 2009-01-03 FALSE NA
4 2009-01-04 FALSE NA
5 2009-01-05 FALSE NA
6 2009-01-06 FALSE NA
7 2009-01-07 FALSE NA
8 2009-01-08 TRUE 1
9 2009-01-09 TRUE NA
10 2009-01-10 TRUE NA
11 2009-01-11 TRUE NA
12 2009-01-12 TRUE NA
13 2009-01-13 TRUE NA
14 2009-01-14 TRUE NA
15 2009-01-15 TRUE NA
16 2009-01-16 FALSE NA
17 2009-01-17 FALSE NA
18 2009-01-18 FALSE NA
19 2009-01-19 FALSE NA
20 2009-01-20 FALSE NA')
我更喜欢 Gregor 的回答,但这里有两个备选方案。
基础 R
x$y <- Sys.Date()[NA] # just a class-stable way
x$y[ !is.na(x$victims) ] <- x$date[ !is.na(x$victims) ]
x$since <- difftime(x$date, zoo::na.locf(x$y, na.rm = FALSE), units="days")
x$oneweeksince <- !is.na(x$since) & (0 <= x$since & x$since <= 7)
dplyr
library(dplyr)
x %>%
mutate(
y = zoo::na.locf(if_else(is.na(victims), date[NA], date), na.rm = FALSE),
since = difftime(date, zoo::na.locf(if_else(is.na(victims), date[NA], date), na.rm = FALSE),
units = "days"),
anotherweeksince = !is.na(since) & between(since, 0, 7)
)
数据:
x <- read.table(stringsAsFactors=FALSE, header=TRUE, text="
date oneweeksince victims
1 2009-01-01 FALSE NA
2 2009-01-02 FALSE NA
3 2009-01-03 FALSE NA
4 2009-01-04 FALSE NA
5 2009-01-05 FALSE NA
6 2009-01-06 FALSE NA
7 2009-01-07 FALSE NA
8 2009-01-08 TRUE 1
9 2009-01-09 TRUE NA
10 2009-01-10 TRUE NA
11 2009-01-11 TRUE NA
12 2009-01-12 TRUE NA
13 2009-01-13 TRUE NA
14 2009-01-14 TRUE NA
15 2009-01-15 TRUE NA
16 2009-01-16 FALSE NA
17 2009-01-17 FALSE NA
18 2009-01-18 FALSE NA
19 2009-01-19 FALSE NA
20 2009-01-20 FALSE NA")
x$date <- as.Date(x$date)
不确定效率,但是使用 sapply
在 base R 中执行此操作的一种方法是对每一行我们返回 7 行并检查它是否满足任何条件和 return相应的布尔输出。
sapply(seq_len(nrow(df)), function(x) {
temp = df$victims[x : pmax(1, x - 7)]
any(temp > 0) & any(!is.na(temp))
})
#[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE
# TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE
解决方案来自@G.Grothendieck
经过一番讨论,这是最有效的答案。
library(dplyr)
library(zoo)
dat2 <- dat %>%
mutate(roll = rollapplyr(victims > 0, 8, any, na.rm = TRUE, fill = NA, partial = TRUE)) %>%
mutate(oneweeksince = roll > 0) %>%
select(-roll)
我之前尝试的解决方案
使用 zoo
包中的 rollapplyr
的解决方案。 rollapplyr
可以应用滚动功能window。在这种情况下,我们可以指定滚动 window 为 8 并应用 mean
函数。请注意,rollmean
函数在这种情况下不适用,因为我们无法在 rollmean
函数中指定 na.rm = TRUE
。最后一步是简单地评估 roll
列是否大于 1。
library(dplyr)
library(zoo)
dat2 <- dat %>%
mutate(roll = rollapplyr(victims, width = 8, FUN = function(x) mean(x, na.rm = TRUE), fill = NA)) %>%
mutate(oneweeksince = roll > 0) %>%
select(-roll)
# dat2
# date victims oneweeksince
# 1 2009-01-01 NA NA
# 2 2009-01-02 NA NA
# 3 2009-01-03 NA NA
# 4 2009-01-04 NA NA
# 5 2009-01-05 NA NA
# 6 2009-01-06 NA NA
# 7 2009-01-07 NA NA
# 8 2009-01-08 1 TRUE
# 9 2009-01-09 NA TRUE
# 10 2009-01-10 NA TRUE
# 11 2009-01-11 NA TRUE
# 12 2009-01-12 NA TRUE
# 13 2009-01-13 NA TRUE
# 14 2009-01-14 NA TRUE
# 15 2009-01-15 NA TRUE
# 16 2009-01-16 NA NA
# 17 2009-01-17 NA NA
# 18 2009-01-18 NA NA
# 19 2009-01-19 NA NA
数据
dat <- read.table(text = " date oneweeksince victims
1 '2009-01-01' FALSE NA
2 '2009-01-02' FALSE NA
3 '2009-01-03' FALSE NA
4 '2009-01-04' FALSE NA
5 '2009-01-05' FALSE NA
6 '2009-01-06' FALSE NA
7 '2009-01-07' FALSE NA
8 '2009-01-08' TRUE 1
9 '2009-01-09' TRUE NA
10 '2009-01-10' TRUE NA
11 '2009-01-11' TRUE NA
12 '2009-01-12' TRUE NA
13 '2009-01-13' TRUE NA
14 '2009-01-14' TRUE NA
15 '2009-01-15' TRUE NA
16 '2009-01-16' FALSE NA
17 '2009-01-17' FALSE NA
18 '2009-01-18' FALSE NA
19 '2009-01-19' FALSE NA
20 '2009-01-20' FALSE NA",
header = TRUE, stringsAsFactors = FALSE)
dat$oneweeksince <- NULL
我的第二次尝试
OP 指出,如果前 N 行中有条目,其中 N 是 window 宽度,我的解决方案将不起作用。在这里,我提供了一个解决方案来解决这个问题。我将使用相同的示例数据框,只是我将 victims
的第二行更改为 1
。新解决方案需要 purrr
和 tidyr
中的函数,因此我为此加载了 tidyverse
包。
library(tidyverse)
library(zoo)
dat2 <- dat %>%
mutate(roll = rollapplyr(victims, width = 8, FUN = function(x) mean(x, na.rm = TRUE), fill = NA)) %>%
# Split the data frame for the first width - 1 rows and others
mutate(GroupID = ifelse(row_number() <= 7, 1L, 2L)) %>%
split(.$GroupID) %>%
# Check if the GroupID is 1. If yes, change the roll column to be the same as victims
# After that, use fill to fill in NA
map_if(function(x) unique(x$GroupID) == 1L,
~.x %>% mutate(roll = victims) %>% fill(roll)) %>%
# Combine data frames
bind_rows() %>%
mutate(oneweeksince = roll > 0) %>%
select(-roll)
# dat2
# date victims GroupID oneweeksince
# 1 2009-01-01 NA 1 NA
# 2 2009-01-02 1 1 TRUE
# 3 2009-01-03 NA 1 TRUE
# 4 2009-01-04 NA 1 TRUE
# 5 2009-01-05 NA 1 TRUE
# 6 2009-01-06 NA 1 TRUE
# 7 2009-01-07 NA 1 TRUE
# 8 2009-01-08 1 2 TRUE
# 9 2009-01-09 NA 2 TRUE
# 10 2009-01-10 NA 2 TRUE
# 11 2009-01-11 NA 2 TRUE
# 12 2009-01-12 NA 2 TRUE
# 13 2009-01-13 NA 2 TRUE
# 14 2009-01-14 NA 2 TRUE
# 15 2009-01-15 NA 2 TRUE
# 16 2009-01-16 NA 2 NA
# 17 2009-01-17 NA 2 NA
# 18 2009-01-18 NA 2 NA
# 19 2009-01-19 NA 2 NA
# 20 2009-01-20 NA 2 NA
数据
dat <- read.table(text = " date oneweeksince victims
1 '2009-01-01' FALSE NA
2 '2009-01-02' FALSE 1
3 '2009-01-03' FALSE NA
4 '2009-01-04' FALSE NA
5 '2009-01-05' FALSE NA
6 '2009-01-06' FALSE NA
7 '2009-01-07' FALSE NA
8 '2009-01-08' TRUE 1
9 '2009-01-09' TRUE NA
10 '2009-01-10' TRUE NA
11 '2009-01-11' TRUE NA
12 '2009-01-12' TRUE NA
13 '2009-01-13' TRUE NA
14 '2009-01-14' TRUE NA
15 '2009-01-15' TRUE NA
16 '2009-01-16' FALSE NA
17 '2009-01-17' FALSE NA
18 '2009-01-18' FALSE NA
19 '2009-01-19' FALSE NA
20 '2009-01-20' FALSE NA",
header = TRUE, stringsAsFactors = FALSE)
dat$oneweeksince <- NULL
我正在寻找一种方法来将前 N 行中基于列的信息编码到给定行。数据集已排序。
简而言之,我想创建一个名为 oneweeksince
的列,如果 returns TRUE
如果 victims
列大于 0(或 !NA)七之后的行。
换句话说,如果对于 row[i]
,row[i]$victims > 0
在 row[i - 7]
到 row[i]
的任何一行中,那么 row[i]$oneweeksince
应该是 TRUE
.在 victims > 0
或 !is.na(victims)
oneweeksince
值也应为 TRUE
如何自动创建这个 column/feature?也可以使用日期列来计算日期距离。由于 R 中的性能较慢,我试图避免创建循环。
数据集应如下所示:
date oneweeksince victims
1 2009-01-01 FALSE NA
2 2009-01-02 FALSE NA
3 2009-01-03 FALSE NA
4 2009-01-04 FALSE NA
5 2009-01-05 FALSE NA
6 2009-01-06 FALSE NA
7 2009-01-07 FALSE NA
8 2009-01-08 TRUE 1
9 2009-01-09 TRUE NA
10 2009-01-10 TRUE NA
11 2009-01-11 TRUE NA
12 2009-01-12 TRUE NA
13 2009-01-13 TRUE NA
14 2009-01-14 TRUE NA
15 2009-01-15 TRUE NA
16 2009-01-16 FALSE NA
17 2009-01-17 FALSE NA
18 2009-01-18 FALSE NA
19 2009-01-19 FALSE NA
20 2009-01-20 FALSE NA
数据集很多年了,所以我需要一种有效的方法来完成它。
我们可以做一个滚动求和并测试它是否大于 0:
library(RcppRoll)
your_data$result = roll_sum(
x = your_data$victims,
n = 8,
na.rm = TRUE,
fill = 0,
align = "right"
) > 0
your_data
# date oneweeksince victims result
# 1 2009-01-01 FALSE NA FALSE
# 2 2009-01-02 FALSE NA FALSE
# 3 2009-01-03 FALSE NA FALSE
# 4 2009-01-04 FALSE NA FALSE
# 5 2009-01-05 FALSE NA FALSE
# 6 2009-01-06 FALSE NA FALSE
# 7 2009-01-07 FALSE NA FALSE
# 8 2009-01-08 TRUE 1 TRUE
# 9 2009-01-09 TRUE NA TRUE
# 10 2009-01-10 TRUE NA TRUE
# 11 2009-01-11 TRUE NA TRUE
# 12 2009-01-12 TRUE NA TRUE
# 13 2009-01-13 TRUE NA TRUE
# 14 2009-01-14 TRUE NA TRUE
# 15 2009-01-15 TRUE NA TRUE
# 16 2009-01-16 FALSE NA FALSE
# 17 2009-01-17 FALSE NA FALSE
# 18 2009-01-18 FALSE NA FALSE
# 19 2009-01-19 FALSE NA FALSE
# 20 2009-01-20 FALSE NA FALSE
使用此数据:
your_data = read.table(header = T, text = ' date oneweeksince victims
1 2009-01-01 FALSE NA
2 2009-01-02 FALSE NA
3 2009-01-03 FALSE NA
4 2009-01-04 FALSE NA
5 2009-01-05 FALSE NA
6 2009-01-06 FALSE NA
7 2009-01-07 FALSE NA
8 2009-01-08 TRUE 1
9 2009-01-09 TRUE NA
10 2009-01-10 TRUE NA
11 2009-01-11 TRUE NA
12 2009-01-12 TRUE NA
13 2009-01-13 TRUE NA
14 2009-01-14 TRUE NA
15 2009-01-15 TRUE NA
16 2009-01-16 FALSE NA
17 2009-01-17 FALSE NA
18 2009-01-18 FALSE NA
19 2009-01-19 FALSE NA
20 2009-01-20 FALSE NA')
我更喜欢 Gregor 的回答,但这里有两个备选方案。
基础 R
x$y <- Sys.Date()[NA] # just a class-stable way
x$y[ !is.na(x$victims) ] <- x$date[ !is.na(x$victims) ]
x$since <- difftime(x$date, zoo::na.locf(x$y, na.rm = FALSE), units="days")
x$oneweeksince <- !is.na(x$since) & (0 <= x$since & x$since <= 7)
dplyr
library(dplyr)
x %>%
mutate(
y = zoo::na.locf(if_else(is.na(victims), date[NA], date), na.rm = FALSE),
since = difftime(date, zoo::na.locf(if_else(is.na(victims), date[NA], date), na.rm = FALSE),
units = "days"),
anotherweeksince = !is.na(since) & between(since, 0, 7)
)
数据:
x <- read.table(stringsAsFactors=FALSE, header=TRUE, text="
date oneweeksince victims
1 2009-01-01 FALSE NA
2 2009-01-02 FALSE NA
3 2009-01-03 FALSE NA
4 2009-01-04 FALSE NA
5 2009-01-05 FALSE NA
6 2009-01-06 FALSE NA
7 2009-01-07 FALSE NA
8 2009-01-08 TRUE 1
9 2009-01-09 TRUE NA
10 2009-01-10 TRUE NA
11 2009-01-11 TRUE NA
12 2009-01-12 TRUE NA
13 2009-01-13 TRUE NA
14 2009-01-14 TRUE NA
15 2009-01-15 TRUE NA
16 2009-01-16 FALSE NA
17 2009-01-17 FALSE NA
18 2009-01-18 FALSE NA
19 2009-01-19 FALSE NA
20 2009-01-20 FALSE NA")
x$date <- as.Date(x$date)
不确定效率,但是使用 sapply
在 base R 中执行此操作的一种方法是对每一行我们返回 7 行并检查它是否满足任何条件和 return相应的布尔输出。
sapply(seq_len(nrow(df)), function(x) {
temp = df$victims[x : pmax(1, x - 7)]
any(temp > 0) & any(!is.na(temp))
})
#[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE
# TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE
解决方案来自@G.Grothendieck
经过一番讨论,这是最有效的答案。
library(dplyr)
library(zoo)
dat2 <- dat %>%
mutate(roll = rollapplyr(victims > 0, 8, any, na.rm = TRUE, fill = NA, partial = TRUE)) %>%
mutate(oneweeksince = roll > 0) %>%
select(-roll)
我之前尝试的解决方案
使用 zoo
包中的 rollapplyr
的解决方案。 rollapplyr
可以应用滚动功能window。在这种情况下,我们可以指定滚动 window 为 8 并应用 mean
函数。请注意,rollmean
函数在这种情况下不适用,因为我们无法在 rollmean
函数中指定 na.rm = TRUE
。最后一步是简单地评估 roll
列是否大于 1。
library(dplyr)
library(zoo)
dat2 <- dat %>%
mutate(roll = rollapplyr(victims, width = 8, FUN = function(x) mean(x, na.rm = TRUE), fill = NA)) %>%
mutate(oneweeksince = roll > 0) %>%
select(-roll)
# dat2
# date victims oneweeksince
# 1 2009-01-01 NA NA
# 2 2009-01-02 NA NA
# 3 2009-01-03 NA NA
# 4 2009-01-04 NA NA
# 5 2009-01-05 NA NA
# 6 2009-01-06 NA NA
# 7 2009-01-07 NA NA
# 8 2009-01-08 1 TRUE
# 9 2009-01-09 NA TRUE
# 10 2009-01-10 NA TRUE
# 11 2009-01-11 NA TRUE
# 12 2009-01-12 NA TRUE
# 13 2009-01-13 NA TRUE
# 14 2009-01-14 NA TRUE
# 15 2009-01-15 NA TRUE
# 16 2009-01-16 NA NA
# 17 2009-01-17 NA NA
# 18 2009-01-18 NA NA
# 19 2009-01-19 NA NA
数据
dat <- read.table(text = " date oneweeksince victims
1 '2009-01-01' FALSE NA
2 '2009-01-02' FALSE NA
3 '2009-01-03' FALSE NA
4 '2009-01-04' FALSE NA
5 '2009-01-05' FALSE NA
6 '2009-01-06' FALSE NA
7 '2009-01-07' FALSE NA
8 '2009-01-08' TRUE 1
9 '2009-01-09' TRUE NA
10 '2009-01-10' TRUE NA
11 '2009-01-11' TRUE NA
12 '2009-01-12' TRUE NA
13 '2009-01-13' TRUE NA
14 '2009-01-14' TRUE NA
15 '2009-01-15' TRUE NA
16 '2009-01-16' FALSE NA
17 '2009-01-17' FALSE NA
18 '2009-01-18' FALSE NA
19 '2009-01-19' FALSE NA
20 '2009-01-20' FALSE NA",
header = TRUE, stringsAsFactors = FALSE)
dat$oneweeksince <- NULL
我的第二次尝试
OP 指出,如果前 N 行中有条目,其中 N 是 window 宽度,我的解决方案将不起作用。在这里,我提供了一个解决方案来解决这个问题。我将使用相同的示例数据框,只是我将 victims
的第二行更改为 1
。新解决方案需要 purrr
和 tidyr
中的函数,因此我为此加载了 tidyverse
包。
library(tidyverse)
library(zoo)
dat2 <- dat %>%
mutate(roll = rollapplyr(victims, width = 8, FUN = function(x) mean(x, na.rm = TRUE), fill = NA)) %>%
# Split the data frame for the first width - 1 rows and others
mutate(GroupID = ifelse(row_number() <= 7, 1L, 2L)) %>%
split(.$GroupID) %>%
# Check if the GroupID is 1. If yes, change the roll column to be the same as victims
# After that, use fill to fill in NA
map_if(function(x) unique(x$GroupID) == 1L,
~.x %>% mutate(roll = victims) %>% fill(roll)) %>%
# Combine data frames
bind_rows() %>%
mutate(oneweeksince = roll > 0) %>%
select(-roll)
# dat2
# date victims GroupID oneweeksince
# 1 2009-01-01 NA 1 NA
# 2 2009-01-02 1 1 TRUE
# 3 2009-01-03 NA 1 TRUE
# 4 2009-01-04 NA 1 TRUE
# 5 2009-01-05 NA 1 TRUE
# 6 2009-01-06 NA 1 TRUE
# 7 2009-01-07 NA 1 TRUE
# 8 2009-01-08 1 2 TRUE
# 9 2009-01-09 NA 2 TRUE
# 10 2009-01-10 NA 2 TRUE
# 11 2009-01-11 NA 2 TRUE
# 12 2009-01-12 NA 2 TRUE
# 13 2009-01-13 NA 2 TRUE
# 14 2009-01-14 NA 2 TRUE
# 15 2009-01-15 NA 2 TRUE
# 16 2009-01-16 NA 2 NA
# 17 2009-01-17 NA 2 NA
# 18 2009-01-18 NA 2 NA
# 19 2009-01-19 NA 2 NA
# 20 2009-01-20 NA 2 NA
数据
dat <- read.table(text = " date oneweeksince victims
1 '2009-01-01' FALSE NA
2 '2009-01-02' FALSE 1
3 '2009-01-03' FALSE NA
4 '2009-01-04' FALSE NA
5 '2009-01-05' FALSE NA
6 '2009-01-06' FALSE NA
7 '2009-01-07' FALSE NA
8 '2009-01-08' TRUE 1
9 '2009-01-09' TRUE NA
10 '2009-01-10' TRUE NA
11 '2009-01-11' TRUE NA
12 '2009-01-12' TRUE NA
13 '2009-01-13' TRUE NA
14 '2009-01-14' TRUE NA
15 '2009-01-15' TRUE NA
16 '2009-01-16' FALSE NA
17 '2009-01-17' FALSE NA
18 '2009-01-18' FALSE NA
19 '2009-01-19' FALSE NA
20 '2009-01-20' FALSE NA",
header = TRUE, stringsAsFactors = FALSE)
dat$oneweeksince <- NULL