如何有条件地将 merge/copy 不同的行合并为一个 [R]
How to merge/copy different rows into one conditionally [R]
我有一个带有名称的大型数据框和一个名为 sequence 的“分类”变量。 sequence 说明其他行的位置。它有两个值:first 和 additional。
问题是这些值的分布不均匀,即没有每个 first 和每个 的 additional letters 值是唯一的。
数据框如下所示(简化版):
letters <- sample(LETTERS, 20)
sequence <- c("first","additional","first","first","first","first","first","additional","additional","additional","first","first","additional","first","additional","additional","first","additional","first","first")
df <- data.drame(sequence, letters)
现在,我想做的是将 字母 中的每个 附加值 粘贴到相应的 中 个字母 中的第一个 个值。
因此,例如,letters 列中的第二个(行)值将被粘贴到第一个中,因为它是相应的 additional。此外,letters 中的第八、第九和第十个值应粘贴到 letters 的第七个值内部(旁边)(例如,第一个;附加;附加;附加).
我已经尝试了以下方法,但有明显的局限性,即它只查看紧邻的下一个值,
library(dplyr)
df <- df %>% mutate(letters_ok = if_else(sequence == "additional",
paste(letters, lag(letters), sep = "; "), letters))
强调我的问题:如何设法有条件地滞后于 sequence 中的值,以便我可以将值粘贴到 letters 根据第一或附加分类?
因为每个 letters 值都是唯一的,并且与特定的 sequence 值相关联,所以我没有使用 group_by . Evry 其他解决方案超出了我目前对 string/character 争论的了解,因此我非常感谢任何帮助。
这是一个 data.table
方法。我稍微更改了您的示例数据,因为 letters
不是一个非常方便的列名。此外,添加 set.seed(123)
用于复制目的。
示例数据
set.seed(123)
letter <- sample(LETTERS, 20)
sequence <- c("first","additional","first","first","first","first","first","additional","additional","additional","first","first","additional","first","additional","additional","first","additional","first","first")
df <- data.frame(sequence, letter)
# sequence letter
# 1 first O
# 2 additional S
# 3 first N
# 4 first C
# 5 first J
# 6 first R
# 7 first K
# 8 additional E
# 9 additional X
# 10 additional Y
# 11 first W
# 12 first T
# 13 additional I
# 14 first L
# 15 additional U
# 16 additional M
# 17 first P
# 18 additional H
# 19 first B
# 20 first G
代码
library( data.table )
#convert to data.table format
setDT(df)
#add id-column
df[, id := .I ]
#perform rolling join
temp <- df[ sequence == "first", ][ df[ sequence == "additional", ],
.( x.letter, i.letter, i.id, x.id),
on = .(id),
roll = Inf ]
#summarise
temp <- temp[, paste0( `i.letter`, collapse = ";" ), by = .(x.id) ]
#join, drop id column
df[sequence == "first", ][ temp, letter := paste( letter, i.V1, sep = ";"), on = .(id = `x.id`) ][, id := NULL]
输出
# sequence letter
# 1: first O;S
# 2: first N
# 3: first C
# 4: first J
# 5: first R
# 6: first K;E;X;Y
# 7: first W
# 8: first T;I
# 9: first L;U;M
#10: first P;H
#11: first B
#12: first G
我有一个带有名称的大型数据框和一个名为 sequence 的“分类”变量。 sequence 说明其他行的位置。它有两个值:first 和 additional。 问题是这些值的分布不均匀,即没有每个 first 和每个 的 additional letters 值是唯一的。 数据框如下所示(简化版):
letters <- sample(LETTERS, 20)
sequence <- c("first","additional","first","first","first","first","first","additional","additional","additional","first","first","additional","first","additional","additional","first","additional","first","first")
df <- data.drame(sequence, letters)
现在,我想做的是将 字母 中的每个 附加值 粘贴到相应的 中 个字母 中的第一个 个值。 因此,例如,letters 列中的第二个(行)值将被粘贴到第一个中,因为它是相应的 additional。此外,letters 中的第八、第九和第十个值应粘贴到 letters 的第七个值内部(旁边)(例如,第一个;附加;附加;附加).
我已经尝试了以下方法,但有明显的局限性,即它只查看紧邻的下一个值,
library(dplyr)
df <- df %>% mutate(letters_ok = if_else(sequence == "additional",
paste(letters, lag(letters), sep = "; "), letters))
强调我的问题:如何设法有条件地滞后于 sequence 中的值,以便我可以将值粘贴到 letters 根据第一或附加分类?
因为每个 letters 值都是唯一的,并且与特定的 sequence 值相关联,所以我没有使用 group_by . Evry 其他解决方案超出了我目前对 string/character 争论的了解,因此我非常感谢任何帮助。
这是一个 data.table
方法。我稍微更改了您的示例数据,因为 letters
不是一个非常方便的列名。此外,添加 set.seed(123)
用于复制目的。
示例数据
set.seed(123)
letter <- sample(LETTERS, 20)
sequence <- c("first","additional","first","first","first","first","first","additional","additional","additional","first","first","additional","first","additional","additional","first","additional","first","first")
df <- data.frame(sequence, letter)
# sequence letter
# 1 first O
# 2 additional S
# 3 first N
# 4 first C
# 5 first J
# 6 first R
# 7 first K
# 8 additional E
# 9 additional X
# 10 additional Y
# 11 first W
# 12 first T
# 13 additional I
# 14 first L
# 15 additional U
# 16 additional M
# 17 first P
# 18 additional H
# 19 first B
# 20 first G
代码
library( data.table )
#convert to data.table format
setDT(df)
#add id-column
df[, id := .I ]
#perform rolling join
temp <- df[ sequence == "first", ][ df[ sequence == "additional", ],
.( x.letter, i.letter, i.id, x.id),
on = .(id),
roll = Inf ]
#summarise
temp <- temp[, paste0( `i.letter`, collapse = ";" ), by = .(x.id) ]
#join, drop id column
df[sequence == "first", ][ temp, letter := paste( letter, i.V1, sep = ";"), on = .(id = `x.id`) ][, id := NULL]
输出
# sequence letter
# 1: first O;S
# 2: first N
# 3: first C
# 4: first J
# 5: first R
# 6: first K;E;X;Y
# 7: first W
# 8: first T;I
# 9: first L;U;M
#10: first P;H
#11: first B
#12: first G