如何删除 R 中不在大写字母中的单词?
How to remove words not in caps in R?
我正在使用 R 进行文本分析。有没有办法使用 tm
或 stringi
删除所有不在大写字母中的单词?
如果我有这样的东西
Albert Einstein went to the store and saw his friend Nikola Tesla ... + 200 pags
转换成
Albert Einstein Nikola Tesla
此致
只需使用 grep
和一个正则表达式:
words <- 'Albert Einstein went to the store and saw his friend Nikola Tesla'
# split to vector of individual words
vec <- unlist(strsplit(words, ' '))
# just the capitalized ones
caps <- grep('^[A-Z]', vec, value = T)
# assemble back to a single string, if you want
paste(caps, collapse=' ')
您可以使用简单的正则表达式删除这些词
gsub("\b[a-z]+\s+", "", x)
# [1] "Albert Einstein Nikola Tesla"
这只是在寻找单词边界>小写字母>它后面的所有字母>它后面的所有空格并删除它
虽然在某些情况下您有诸如 don't
之类的词,但您需要稍微复杂一点的正则表达式。像
x <- "if Albert Einstein didn't see his friend Nikola Tesla leavin'"
gsub("\b[a-z][^ ]*(\s+)?", "", x)
# [1] "Albert Einstein Nikola Tesla "
我正在使用 R 进行文本分析。有没有办法使用 tm
或 stringi
删除所有不在大写字母中的单词?
如果我有这样的东西
Albert Einstein went to the store and saw his friend Nikola Tesla ... + 200 pags
转换成
Albert Einstein Nikola Tesla
此致
只需使用 grep
和一个正则表达式:
words <- 'Albert Einstein went to the store and saw his friend Nikola Tesla'
# split to vector of individual words
vec <- unlist(strsplit(words, ' '))
# just the capitalized ones
caps <- grep('^[A-Z]', vec, value = T)
# assemble back to a single string, if you want
paste(caps, collapse=' ')
您可以使用简单的正则表达式删除这些词
gsub("\b[a-z]+\s+", "", x)
# [1] "Albert Einstein Nikola Tesla"
这只是在寻找单词边界>小写字母>它后面的所有字母>它后面的所有空格并删除它
虽然在某些情况下您有诸如 don't
之类的词,但您需要稍微复杂一点的正则表达式。像
x <- "if Albert Einstein didn't see his friend Nikola Tesla leavin'"
gsub("\b[a-z][^ ]*(\s+)?", "", x)
# [1] "Albert Einstein Nikola Tesla "