lead 或 lag 函数以获得多个值,而不仅仅是第 n 个值
lead or lag function to get several values, not just the nth
我有一个小标题,每行都有一个单词列表。我想从搜索关键字的函数创建一个新变量,如果找到关键字,则创建一个由关键字 plus-and-minus 3 个单词组成的字符串。
下面的代码是 close,但是,它没有抓取关键字前后的 所有三个 词,而是抓取单个词3 ahead/behind.
df <- tibble(words = c("it", "was", "the", "best", "of", "times",
"it", "was", "the", "worst", "of", "times"))
df <- df %>% mutate(chunks = ifelse(words=="times",
paste(lag(words, 3),
words,
lead(words, 3), sep = " "),
NA))
最直观的解决方案是 lag
函数是否可以执行如下操作:lead(words, 1:3)
但这不起作用。
显然我可以很快地手动完成此操作 (paste(lead(words,3), lead(words,2), lead(words,1),...lag(words,3)
),但我最终实际上希望能够获取关键字 plus-and-minus 50 字——太多 hand-code。
如果 tidyverse 中存在解决方案将是理想的,但任何解决方案都会有所帮助。任何帮助将不胜感激。
一个选项是 sapply
:
library(dplyr)
df %>%
mutate(
chunks = ifelse(
words == "times",
sapply(
1:nrow(.),
function(x) paste(words[pmax(1, x - 3):pmin(x + 3, nrow(.))], collapse = " ")
),
NA
)
)
输出:
# A tibble: 12 x 2
words chunks
<chr> <chr>
1 it NA
2 was NA
3 the NA
4 best NA
5 of NA
6 times the best of times it was the
7 it NA
8 was NA
9 the NA
10 worst NA
11 of NA
12 times the worst of times
虽然不是明确的 lead
或 lag
功能,但它通常也可以达到目的。
类似于@arg0naut 但没有 dplyr:
r = 1:nrow(df)
w = which(df$words == "times")
wm = lapply(w, function(wi) intersect(r, seq(wi-3L, wi+3L)))
df$chunks <- NA_character_
df$chunks[w] <- tapply(df$words[unlist(wm)], rep(w, lengths(wm)), FUN = paste, collapse=" ")
# A tibble: 12 x 2
words chunks
<chr> <chr>
1 it <NA>
2 was <NA>
3 the <NA>
4 best <NA>
5 of <NA>
6 times the best of times it was the
7 it <NA>
8 was <NA>
9 the <NA>
10 worst <NA>
11 of <NA>
12 times the worst of times
data.table翻译:
library(data.table)
DT = data.table(df)
w = DT["times", on="words", which=TRUE]
wm = lapply(w, function(wi) intersect(r, seq(wi-3L, wi+3L)))
DT[w, chunks := DT[unlist(wm), paste(words, collapse=" "), by=rep(w, lengths(wm))]$V1]
data.table::shift
接受 n
(滞后)参数的向量并输出一个列表,因此您可以将其与 do.call(paste
列表元素一起使用。但是,除非您使用的是 data.table 版本 >= 1.12,否则我认为它不会让您混合正负 n
值(如下所示)。
有数据table:
library(data.table)
setDT(df)
df[, chunks := trimws(ifelse(words != "times", NA, do.call(paste, shift(words, 3:-3, ''))))]
# words chunks
# 1: it <NA>
# 2: was <NA>
# 3: the <NA>
# 4: best <NA>
# 5: of <NA>
# 6: times the best of times it was the
# 7: it <NA>
# 8: was <NA>
# 9: the <NA>
# 10: worst <NA>
# 11: of <NA>
# 12: times the worst of times
使用 dplyr
且仅将 data.table 用于 shift
函数:
library(dplyr)
df %>%
mutate(chunks = do.call(paste, data.table::shift(words, 3:-3, fill = '')),
chunks = trimws(ifelse(words != "times", NA, chunks)))
# # A tibble: 12 x 2
# words chunks
# <chr> <chr>
# 1 it NA
# 2 was NA
# 3 the NA
# 4 best NA
# 5 of NA
# 6 times the best of times it was the
# 7 it NA
# 8 was NA
# 9 the NA
# 10 worst NA
# 11 of NA
# 12 times the worst of times
这是使用 lag
和 lead
的另一个 tidyverse
解决方案
laglead_f <- function(what, range)
setNames(paste(what, "(., ", range, ", default = '')"), paste(what, range))
df %>%
mutate_at(vars(words), funs_(c(laglead_f("lag", 3:0), laglead_f("lead", 1:3)))) %>%
unite(chunks, -words, sep = " ") %>%
mutate(chunks = ifelse(words == "times", trimws(chunks), NA))
## A tibble: 12 x 2
# words chunks
# <chr> <chr>
# 1 it NA
# 2 was NA
# 3 the NA
# 4 best NA
# 5 of NA
# 6 times the best of times it was the
# 7 it NA
# 8 was NA
# 9 the NA
#10 worst NA
#11 of NA
#12 times the worst of times
想法是将来自三个 lag
ged 和 lead
ing 向量的值存储在具有 mutate_at
和命名函数的新列中,unite
这些列和然后根据您的条件过滤条目 words == "times"
.
我有一个小标题,每行都有一个单词列表。我想从搜索关键字的函数创建一个新变量,如果找到关键字,则创建一个由关键字 plus-and-minus 3 个单词组成的字符串。
下面的代码是 close,但是,它没有抓取关键字前后的 所有三个 词,而是抓取单个词3 ahead/behind.
df <- tibble(words = c("it", "was", "the", "best", "of", "times",
"it", "was", "the", "worst", "of", "times"))
df <- df %>% mutate(chunks = ifelse(words=="times",
paste(lag(words, 3),
words,
lead(words, 3), sep = " "),
NA))
最直观的解决方案是 lag
函数是否可以执行如下操作:lead(words, 1:3)
但这不起作用。
显然我可以很快地手动完成此操作 (paste(lead(words,3), lead(words,2), lead(words,1),...lag(words,3)
),但我最终实际上希望能够获取关键字 plus-and-minus 50 字——太多 hand-code。
如果 tidyverse 中存在解决方案将是理想的,但任何解决方案都会有所帮助。任何帮助将不胜感激。
一个选项是 sapply
:
library(dplyr)
df %>%
mutate(
chunks = ifelse(
words == "times",
sapply(
1:nrow(.),
function(x) paste(words[pmax(1, x - 3):pmin(x + 3, nrow(.))], collapse = " ")
),
NA
)
)
输出:
# A tibble: 12 x 2
words chunks
<chr> <chr>
1 it NA
2 was NA
3 the NA
4 best NA
5 of NA
6 times the best of times it was the
7 it NA
8 was NA
9 the NA
10 worst NA
11 of NA
12 times the worst of times
虽然不是明确的 lead
或 lag
功能,但它通常也可以达到目的。
类似于@arg0naut 但没有 dplyr:
r = 1:nrow(df)
w = which(df$words == "times")
wm = lapply(w, function(wi) intersect(r, seq(wi-3L, wi+3L)))
df$chunks <- NA_character_
df$chunks[w] <- tapply(df$words[unlist(wm)], rep(w, lengths(wm)), FUN = paste, collapse=" ")
# A tibble: 12 x 2
words chunks
<chr> <chr>
1 it <NA>
2 was <NA>
3 the <NA>
4 best <NA>
5 of <NA>
6 times the best of times it was the
7 it <NA>
8 was <NA>
9 the <NA>
10 worst <NA>
11 of <NA>
12 times the worst of times
data.table翻译:
library(data.table)
DT = data.table(df)
w = DT["times", on="words", which=TRUE]
wm = lapply(w, function(wi) intersect(r, seq(wi-3L, wi+3L)))
DT[w, chunks := DT[unlist(wm), paste(words, collapse=" "), by=rep(w, lengths(wm))]$V1]
data.table::shift
接受 n
(滞后)参数的向量并输出一个列表,因此您可以将其与 do.call(paste
列表元素一起使用。但是,除非您使用的是 data.table 版本 >= 1.12,否则我认为它不会让您混合正负 n
值(如下所示)。
有数据table:
library(data.table)
setDT(df)
df[, chunks := trimws(ifelse(words != "times", NA, do.call(paste, shift(words, 3:-3, ''))))]
# words chunks
# 1: it <NA>
# 2: was <NA>
# 3: the <NA>
# 4: best <NA>
# 5: of <NA>
# 6: times the best of times it was the
# 7: it <NA>
# 8: was <NA>
# 9: the <NA>
# 10: worst <NA>
# 11: of <NA>
# 12: times the worst of times
使用 dplyr
且仅将 data.table 用于 shift
函数:
library(dplyr)
df %>%
mutate(chunks = do.call(paste, data.table::shift(words, 3:-3, fill = '')),
chunks = trimws(ifelse(words != "times", NA, chunks)))
# # A tibble: 12 x 2
# words chunks
# <chr> <chr>
# 1 it NA
# 2 was NA
# 3 the NA
# 4 best NA
# 5 of NA
# 6 times the best of times it was the
# 7 it NA
# 8 was NA
# 9 the NA
# 10 worst NA
# 11 of NA
# 12 times the worst of times
这是使用 lag
和 lead
tidyverse
解决方案
laglead_f <- function(what, range)
setNames(paste(what, "(., ", range, ", default = '')"), paste(what, range))
df %>%
mutate_at(vars(words), funs_(c(laglead_f("lag", 3:0), laglead_f("lead", 1:3)))) %>%
unite(chunks, -words, sep = " ") %>%
mutate(chunks = ifelse(words == "times", trimws(chunks), NA))
## A tibble: 12 x 2
# words chunks
# <chr> <chr>
# 1 it NA
# 2 was NA
# 3 the NA
# 4 best NA
# 5 of NA
# 6 times the best of times it was the
# 7 it NA
# 8 was NA
# 9 the NA
#10 worst NA
#11 of NA
#12 times the worst of times
想法是将来自三个 lag
ged 和 lead
ing 向量的值存储在具有 mutate_at
和命名函数的新列中,unite
这些列和然后根据您的条件过滤条目 words == "times"
.