删除 r 数据框中一列中特定类型的重复词

removing duplicate words of specific type within a column in r dataframe

我有一列状态如下所示

            State
            Arizona, Arizona, Arizona, Arizona, 
            Arizona, Arizona, Arizona, California Carmel Beach, California LBC, California Napa, Arizona
            Virginia, Virginia, Virginia
            .
            .
            .

我想删除特定类型的所有重复单词,在这种情况下保留一个唯一单词我只想删除重复的亚利桑那州单词和弗吉尼亚单词,最终数据集应如下所示

            Result
            Arizona
            Arizona, California Carmel Beach, California LBC, California Napa
            Virginia
            .
            .
            .

我想这就是你想要的。

trimmed <- gsub('^\s*','',state)
trimmed <- gsub('\s*$','',trimmed)
lapply(lapply(strsplit(trimmed,'\s*,\s*'),unique),paste,sep =', ')
# Create a test data vector
testin <- c(
"Arizona, Arizona, Arizona, Arizona, ", 
"Arizona, Arizona, Arizona, California Carmel Beach, California LBC, California Napa, Arizona", 
"Virginia, Virginia, Virginia"
)

# The names to remove if duplicated
kickDuplicates <- c("Arizona", "Virginia")


# create a list of vectors of place names
broken <- strsplit(testin, ",\s*")

# paste each broken vector of place names back together
# .......kicking out duplicated instances of the chosen names
testout <- sapply(broken, FUN = function(x)  paste(x[!duplicated(x) | !x %in% kickDuplicates ], collapse = ", "))

# see what we did 
testout

您可以尝试使用单个 gsub 来获取唯一值,但元素的顺序会有所不同

df1$Result <- gsub('(\b\S+\b)(?=.*\b\1\b.*), ', "",
         df1$State, perl=TRUE)

Regex101

df1$Result
#[1] "Arizona"                                                          
#[2] "California Carmel Beach, California LBC, California Napa, Arizona"
#[3] "Virginia"    

数据

df1 <- structure(list(State = c("Arizona, Arizona, Arizona, Arizona", 
"Arizona, Arizona, Arizona, California Carmel Beach, California LBC, California Napa, Arizona", 
"Virginia, Virginia, Virginia")), .Names = "State", class = "data.frame", 
 row.names = c(NA, -3L))