如何删除不属于其他字符向量的任何字符的某些字符?
How to remove certain characters that don't belong to any characters of an other character vector?
我有一个未清理的字符向量,我想删除该向量中不属于另一个字符向量的某些字符。所以基本上我知道我想保留什么,但我不知道要删除什么,这使得 gsub()
和 str_replace_all
很难工作。
我要清理的字符串是issue_uncleaned
,长这样(不是完整版):
[1] "Facebook Fact-checks; Coronavirus; TikTok posts "
[2] "Facebook Fact-checks; Facebook posts "
[3] "Facebook Fact-checks; Coronavirus; Bloggers "
[4] "Facebook Fact-checks; Facebook posts "
[5] "National; Criminal Justice; Crime; Facebook Fact-checks; Facebook posts "
我想用作过滤器来删除不需要的字符的字符串是151_issues
,它看起来像这样(不是完整版本):
[1] "Facebook Fact-checks" "Coronavirus" “Crime”
我想要的结果:(如果还有办法去掉开头或者最后的;就更好了)
[1] "Facebook Fact-checks; Coronavirus; "
[2] "Facebook Fact-checks; "
[3] "Facebook Fact-checks; Coronavirus; "
[4] "Facebook Fact-checks; "
[5] "; ; Crime; Facebook Fact-checks; "
非常感谢您的帮助!
使用 strsplit
然后 intersect
和 paste
再次使用。
sapply(lapply(strsplit(v, '; '), intersect, issues), paste, collapse='; ')
# [1] "Facebook Fact-checks; Coronavirus" "Facebook Fact-checks"
# [3] "Facebook Fact-checks; Coronavirus" "Facebook Fact-checks"
# [5] "Facebook Fact-checks"
数据:
v <- c("Facebook Fact-checks; Coronavirus; TikTok posts", "Facebook Fact-checks; Facebook posts",
"Facebook Fact-checks; Coronavirus; Bloggers", "Facebook Fact-checks; Facebook posts",
"National; Criminal Justice; Crime; Facebook Fact-checks; Facebook posts"
)
issues <- c("Facebook Fact-checks", "After the Fact", "Animals", "Bankruptcy",
"Border Security", "Ad Watch", "Agriculture", "Ask PolitiFact",
"Baseball", "Bush Administration", "Afghanistan", "Alcohol",
"Autism", "Bipartisanship", "Coronavirus")
issue_uncleaned <- c("Facebook Fact-checks; Coronavirus; TikTok posts ", "Facebook Fact-checks; Facebook posts ", "Facebook Fact-checks; Coronavirus; Bloggers ", "Facebook Fact-checks; Facebook posts ", "National; Criminal Justice; Crime; Facebook Fact-checks; Facebook posts ")
issues_151 <- c("Facebook Fact-checks", "Coronavirus", "Crime")
k <- strsplit(issue_uncleaned, "; ")
k <- lapply(k, trimws) # removes the white space at the end or beginning
k2 <- sapply(1:length(k), function(x, data){return(data[[x]][which(data[[x]] %in% issues_151)])}, data = k)
issue_cleaned <- sapply(k2, paste0, collapse = "; ")
我有一个未清理的字符向量,我想删除该向量中不属于另一个字符向量的某些字符。所以基本上我知道我想保留什么,但我不知道要删除什么,这使得 gsub()
和 str_replace_all
很难工作。
我要清理的字符串是issue_uncleaned
,长这样(不是完整版):
[1] "Facebook Fact-checks; Coronavirus; TikTok posts "
[2] "Facebook Fact-checks; Facebook posts "
[3] "Facebook Fact-checks; Coronavirus; Bloggers "
[4] "Facebook Fact-checks; Facebook posts "
[5] "National; Criminal Justice; Crime; Facebook Fact-checks; Facebook posts "
我想用作过滤器来删除不需要的字符的字符串是151_issues
,它看起来像这样(不是完整版本):
[1] "Facebook Fact-checks" "Coronavirus" “Crime”
我想要的结果:(如果还有办法去掉开头或者最后的;就更好了)
[1] "Facebook Fact-checks; Coronavirus; "
[2] "Facebook Fact-checks; "
[3] "Facebook Fact-checks; Coronavirus; "
[4] "Facebook Fact-checks; "
[5] "; ; Crime; Facebook Fact-checks; "
非常感谢您的帮助!
使用 strsplit
然后 intersect
和 paste
再次使用。
sapply(lapply(strsplit(v, '; '), intersect, issues), paste, collapse='; ')
# [1] "Facebook Fact-checks; Coronavirus" "Facebook Fact-checks"
# [3] "Facebook Fact-checks; Coronavirus" "Facebook Fact-checks"
# [5] "Facebook Fact-checks"
数据:
v <- c("Facebook Fact-checks; Coronavirus; TikTok posts", "Facebook Fact-checks; Facebook posts",
"Facebook Fact-checks; Coronavirus; Bloggers", "Facebook Fact-checks; Facebook posts",
"National; Criminal Justice; Crime; Facebook Fact-checks; Facebook posts"
)
issues <- c("Facebook Fact-checks", "After the Fact", "Animals", "Bankruptcy",
"Border Security", "Ad Watch", "Agriculture", "Ask PolitiFact",
"Baseball", "Bush Administration", "Afghanistan", "Alcohol",
"Autism", "Bipartisanship", "Coronavirus")
issue_uncleaned <- c("Facebook Fact-checks; Coronavirus; TikTok posts ", "Facebook Fact-checks; Facebook posts ", "Facebook Fact-checks; Coronavirus; Bloggers ", "Facebook Fact-checks; Facebook posts ", "National; Criminal Justice; Crime; Facebook Fact-checks; Facebook posts ")
issues_151 <- c("Facebook Fact-checks", "Coronavirus", "Crime")
k <- strsplit(issue_uncleaned, "; ")
k <- lapply(k, trimws) # removes the white space at the end or beginning
k2 <- sapply(1:length(k), function(x, data){return(data[[x]][which(data[[x]] %in% issues_151)])}, data = k)
issue_cleaned <- sapply(k2, paste0, collapse = "; ")