删除 r 数据框中一列中特定类型的重复词
removing duplicate words of specific type within a column in r dataframe
我有一列状态如下所示
State
Arizona, Arizona, Arizona, Arizona,
Arizona, Arizona, Arizona, California Carmel Beach, California LBC, California Napa, Arizona
Virginia, Virginia, Virginia
.
.
.
我想删除特定类型的所有重复单词,在这种情况下保留一个唯一单词我只想删除重复的亚利桑那州单词和弗吉尼亚单词,最终数据集应如下所示
Result
Arizona
Arizona, California Carmel Beach, California LBC, California Napa
Virginia
.
.
.
我想这就是你想要的。
trimmed <- gsub('^\s*','',state)
trimmed <- gsub('\s*$','',trimmed)
lapply(lapply(strsplit(trimmed,'\s*,\s*'),unique),paste,sep =', ')
# Create a test data vector
testin <- c(
"Arizona, Arizona, Arizona, Arizona, ",
"Arizona, Arizona, Arizona, California Carmel Beach, California LBC, California Napa, Arizona",
"Virginia, Virginia, Virginia"
)
# The names to remove if duplicated
kickDuplicates <- c("Arizona", "Virginia")
# create a list of vectors of place names
broken <- strsplit(testin, ",\s*")
# paste each broken vector of place names back together
# .......kicking out duplicated instances of the chosen names
testout <- sapply(broken, FUN = function(x) paste(x[!duplicated(x) | !x %in% kickDuplicates ], collapse = ", "))
# see what we did
testout
您可以尝试使用单个 gsub
来获取唯一值,但元素的顺序会有所不同
df1$Result <- gsub('(\b\S+\b)(?=.*\b\1\b.*), ', "",
df1$State, perl=TRUE)
df1$Result
#[1] "Arizona"
#[2] "California Carmel Beach, California LBC, California Napa, Arizona"
#[3] "Virginia"
数据
df1 <- structure(list(State = c("Arizona, Arizona, Arizona, Arizona",
"Arizona, Arizona, Arizona, California Carmel Beach, California LBC, California Napa, Arizona",
"Virginia, Virginia, Virginia")), .Names = "State", class = "data.frame",
row.names = c(NA, -3L))
我有一列状态如下所示
State
Arizona, Arizona, Arizona, Arizona,
Arizona, Arizona, Arizona, California Carmel Beach, California LBC, California Napa, Arizona
Virginia, Virginia, Virginia
.
.
.
我想删除特定类型的所有重复单词,在这种情况下保留一个唯一单词我只想删除重复的亚利桑那州单词和弗吉尼亚单词,最终数据集应如下所示
Result
Arizona
Arizona, California Carmel Beach, California LBC, California Napa
Virginia
.
.
.
我想这就是你想要的。
trimmed <- gsub('^\s*','',state)
trimmed <- gsub('\s*$','',trimmed)
lapply(lapply(strsplit(trimmed,'\s*,\s*'),unique),paste,sep =', ')
# Create a test data vector
testin <- c(
"Arizona, Arizona, Arizona, Arizona, ",
"Arizona, Arizona, Arizona, California Carmel Beach, California LBC, California Napa, Arizona",
"Virginia, Virginia, Virginia"
)
# The names to remove if duplicated
kickDuplicates <- c("Arizona", "Virginia")
# create a list of vectors of place names
broken <- strsplit(testin, ",\s*")
# paste each broken vector of place names back together
# .......kicking out duplicated instances of the chosen names
testout <- sapply(broken, FUN = function(x) paste(x[!duplicated(x) | !x %in% kickDuplicates ], collapse = ", "))
# see what we did
testout
您可以尝试使用单个 gsub
来获取唯一值,但元素的顺序会有所不同
df1$Result <- gsub('(\b\S+\b)(?=.*\b\1\b.*), ', "",
df1$State, perl=TRUE)
df1$Result
#[1] "Arizona"
#[2] "California Carmel Beach, California LBC, California Napa, Arizona"
#[3] "Virginia"
数据
df1 <- structure(list(State = c("Arizona, Arizona, Arizona, Arizona",
"Arizona, Arizona, Arizona, California Carmel Beach, California LBC, California Napa, Arizona",
"Virginia, Virginia, Virginia")), .Names = "State", class = "data.frame",
row.names = c(NA, -3L))