如何减少字符串中冗余的重复模式，但将它们保留在后面的位置？

Question

我试图从字符串中删除重复的模式，但保留出现在后面位置的模式。遗憾的是，如果不删除不同位置的模式，我似乎无法通过 gsub 执行此操作。

基本上我想转这个：

"Place1-Place2-Place2-Place4-Place2-Place3-Place5"
"Place1-Place1-Place1-Place1-Place3-Place5"
"Place1-Place4-Place2-Place3-Place3-Place5-Place5"

进入这个：

"Place1-Place2-Place4-Place2-Place3-Place5"
"Place1-Place3-Place5"
"Place1-Place4-Place2-Place3-Place5"

这是我目前的情况：

library(stringr)
library(data.table)
library(dplyr)

df <- c("Place1-Place2-Place2-Place4-Place2-Place3-Place5",
        "Place1-Place1-Place1-Place1-Place3-Place5",
        "Place1-Place4-Place2-Place3-Place3-Place5-Place5")

df_split <- as.data.table(str_split(df[1],"-"))
df[1] <- df_split %>% summarise(location=paste(unique(V1),collapse="-"))
df[1]

提前致谢！

Answer 1

这里有一些方法。没有使用包。（请参阅末尾的注释以了解 s。）

1) gsub 用空字符串替换后跟减号的任何非减号子串及其本身（其中减号也被替换为空字符串）：

gsub("([^-]+)-(?=\1)", "", s, perl = TRUE)

给予：

[1] "Place1-Place2-Place4-Place2-Place3-Place5"
[2] "Place1-Place3-Place5"                     
[3] "Place1-Place4-Place2-Place3-Place5"       
[4] "Place6"                                   
[5] "Place7"

2) strsplit/paste 交替拆分字符串，检查重复项并粘贴回去给出相同的答案：

undup_paste <- function(x) paste(x[c(TRUE, x[-1] != x[-length(x)])], collapse = "-")
sapply(strsplit(s, "-"), undup_paste)

3) rle 这与 (2) 中一样使用 strsplit 和 paste 但 x[...] 部分被替换为 rle：

sapply(strsplit(s, "-"), function(x) paste(rle(x)$values, collapse = "-"))

注：可重现形式的输入字符向量s为：

s <- c("Place1-Place2-Place2-Place4-Place2-Place3-Place5",
       "Place1-Place1-Place1-Place1-Place3-Place5",
       "Place1-Place4-Place2-Place3-Place3-Place5-Place5",
       "Place6-Place6", 
       "Place7-Place7-Place7-Place7-Place7-Place7-Place7-Place7")

这与问题中的 df 相同，只是我们在评论中建议的最后添加了两个组件。

如何减少字符串中冗余的重复模式，但将它们保留在后面的位置？

How to reduce redundant repeated patterns in a string, but retain those at a latter position?

string

r

gsub