从字符串中删除重复单词的最短方法

Question

我有这个字符串：

x <- c("A B B C")

[1] "A B B C"

我正在寻找最短的途径：

[1] "A B C"

我试过这个： Removing duplicate words in a string in R

paste(unique(x), collapse = ' ')

[1] "A B B C"
# does not work

背景：在数据框列中，我只想计算唯一字数。

Answer 1

基于 regex 的方法可能更短 - 匹配 non-white space (\S+) 后跟白色 space 字符 (\s), 捕获它，然后是一次或多次出现的反向引用，并在替换中，将反向引用指定为 return 仅匹配的单个副本

gsub("(\S+\s)\1+", "\1", x)
[1] "A B C"

或者可能需要用strsplit、unlist拆分字符串，得到unique然后paste

paste(unique(unlist(strsplit(x, " "))), collapse = " ")
# [1] "A B C"

Answer 2

另一种可能的解决方案，基于stringr::str_split：

library(tidyverse)

str_split(x, " ") %>% unlist %>% unique

#> [1] "A" "B" "C"

Answer 3

您可以使用 ,

gsub("\b(\w+)(?:\W+\1\b)+", "\1", x)

Answer 4

以防重复项没有相互跟随，也使用 gsub。

x <- c("A B B C")
gsub("\b(\S+)\s+(?=.*\b\1\b)", "", x, perl=TRUE)
#[1] "A B C"

gsub("\b(\S+)\s+(?=.*\b\1\b)", "", "A B B A ABBA", perl=TRUE)
#[1] "B A ABBA"

从字符串中删除重复单词的最短方法

Shortest way to remove duplicate words from string

string

r

duplicates