如何从 R 中的 ngram 标记列表中有效地删除停用词
How to remove stopwords efficiently from a list of ngram tokens in R
这里有一个更好的方法来做一些我已经可以低效地做的事情的呼吁:使用 "stop words" 过滤一系列 n-gram 标记,以便出现n-gram 中的任何停用词术语都会触发删除。
我非常希望有一个同时适用于 unigram 和 n-gram 的解决方案,尽管有两个版本也可以,一个带有 "fixed" 标志,一个带有 "regex" 标志。我将问题的两个方面放在一起,因为有人可能有一个解决方案尝试使用不同的方法来解决固定和正则表达式停用词模式。
格式:
tokens 是字符向量列表,可以是一元语法,也可以是由 _
(下划线)字符连接的 n-gram。
stopwords 是一个字符向量。现在我满足于让它成为一个固定的字符串,但是如果能够使用正则表达式格式的停用词来实现它也是一个很好的奖励。
所需输出: 与输入 tokens 匹配的字符列表,但任何与停用词匹配的组件标记被删除。 (这意味着 unigram 匹配,或与 n-gram 包含的术语之一的匹配。)
示例、测试数据、工作代码和基准测试:
tokens1 <- list(text1 = c("this", "is", "a", "test", "text", "with", "a", "few", "words"),
text2 = c("some", "more", "words", "in", "this", "test", "text"))
tokens2 <- list(text1 = c("this_is", "is_a", "a_test", "test_text", "text_with", "with_a", "a_few", "few_words"),
text2 = c("some_more", "more_words", "words_in", "in_this", "this_text", "text_text"))
tokens3 <- list(text1 = c("this_is_a", "is_a_test", "a_test_text", "test_text_with", "text_with_a", "with_a_few", "a_few_words"),
text2 = c("some_more_words", "more_words_in", "words_in_this", "in_this_text", "this_text_text"))
stopwords <- c("is", "a", "in", "this")
# remove any single token that matches a stopword
removeTokensOP1 <- function(w, stopwords) {
lapply(w, function(x) x[-which(x %in% stopwords)])
}
# remove any word pair where a single word contains a stopword
removeTokensOP2 <- function(w, stopwords) {
matchPattern <- paste0("(^|_)", paste(stopwords, collapse = "(_|$)|(^|_)"), "(_|$)")
lapply(w, function(x) x[-grep(matchPattern, x)])
}
removeTokensOP1(tokens1, stopwords)
## $text1
## [1] "test" "text" "with" "few" "words"
##
## $text2
## [1] "some" "more" "words" "test" "text"
removeTokensOP2(tokens1, stopwords)
## $text1
## [1] "test" "text" "with" "few" "words"
##
## $text2
## [1] "some" "more" "words" "test" "text"
removeTokensOP2(tokens2, stopwords)
## $text1
## [1] "test_text" "text_with" "few_words"
##
## $text2
## [1] "some_more" "more_words" "text_text"
removeTokensOP2(tokens3, stopwords)
## $text1
## [1] "test_text_with"
##
## $text2
## [1] "some_more_words"
# performance benchmarks for answers to build on
require(microbenchmark)
microbenchmark(OP1_1 = removeTokensOP1(tokens1, stopwords),
OP2_1 = removeTokensOP2(tokens1, stopwords),
OP2_2 = removeTokensOP2(tokens2, stopwords),
OP2_3 = removeTokensOP2(tokens3, stopwords),
unit = "relative")
## Unit: relative
## expr min lq mean median uq max neval
## OP1_1 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 100
## OP2_1 5.119066 3.812845 3.438076 3.714492 3.547187 2.838351 100
## OP2_2 5.230429 3.903135 3.509935 3.790143 3.631305 2.510629 100
## OP2_3 5.204924 3.884746 3.578178 3.753979 3.553729 8.240244 100
如果您使用 parallel
包在列表中有很多级别,我们可以改进 lapply
。
创建多个关卡
tokens2 <- list(text1 = c("this_is", "is_a", "a_test", "test_text", "text_with", "with_a", "a_few", "few_words"),
text2 = c("some_more", "more_words", "words_in", "in_this", "this_text", "text_text"))
tokens2 <- lapply(1:500,function(x) sample(tokens2,1)[[1]])
我们这样做是因为并行包有很多开销需要设置,因此仅仅增加微基准测试的迭代次数就会继续产生该成本。通过增加列表的大小,您会看到真正的改进。
library(parallel)
#Setup
cl <- detectCores()
cl <- makeCluster(cl)
#Two functions:
#original
removeTokensOP2 <- function(w, stopwords) {
matchPattern <- paste0("(^|_)", paste(stopwords, collapse = "(_|$)|(^|_)"), "(_|$)")
lapply(w, function(x) x[-grep(matchPattern, x)])
}
#new
removeTokensOPP <- function(w, stopwords) {
matchPattern <- paste0("(^|_)", paste(stopwords, collapse = "(_|$)|(^|_)"), "(_|$)")
return(w[-grep(matchPattern, w)])
}
#compare
microbenchmark(
OP2_P = parLapply(cl,tokens2,removeTokensOPP,stopwords),
OP2_2 = removeTokensOP2(tokens2, stopwords),
unit = 'relative'
)
Unit: relative
expr min lq mean median uq max neval
OP2_P 1.000000 1.000000 1.000000 1.000000 1.000000 1.00000 100
OP2_2 1.730565 1.653872 1.678781 1.562258 1.471347 10.11306 100
随着列表中级别数的增加,性能将会提高。
您可以考虑简化正则表达式,^ 和 $ 会增加开销
remove_short <- function(x, stopwords) {
stopwords_regexp <- paste0('(^|_)(', paste(stopwords, collapse = '|'), ')(_|$)')
lapply(x, function(x) x[!grepl(stopwords_regexp, x)])
}
require(microbenchmark)
microbenchmark(OP1_1 = removeTokensOP1(tokens1, stopwords),
OP2_1 = removeTokensOP2(tokens2, stopwords),
OP2_2 = remove_short(tokens2, stopwords),
unit = "relative")
Unit: relative
expr min lq mean median uq max neval cld
OP1_1 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 100 a
OP2_1 5.178565 4.768749 4.465138 4.441130 4.262399 4.266905 100 c
OP2_2 3.452386 3.247279 3.063660 3.068571 2.963794 2.948189 100 b
这并不是一个真正的答案 - 更多的是回复 rawr 关于遍历所有停用词组合的评论的评论。对于更长的 stopwords
列表,使用 %in%
之类的东西似乎不会遇到维度问题。
library(purrr)
removetokenstst <- function(tokens, stopwords)
map2(tokens,
lapply(tokens3, function(x) {
unlist(lapply(strsplit(x, "_"), function(y) {
any(y %in% stopwords)
}))
}),
~ .x[!.y])
require(microbenchmark)
microbenchmark(OP1_1 = removeTokensOP1(tokens1, morestopwords),
OP2_1 = removeTokensOP2(tokens1, morestopwords),
OP2_2 = removeTokensOP2(tokens2, morestopwords),
OP2_3 = removeTokensOP2(tokens3, morestopwords),
Ak_3 = removetokenstst(tokens3, stopwords),
Ak_3msw = removetokenstst(tokens3, morestopwords),
unit = "relative")
Unit: relative
expr min lq mean median uq max neval
OP1_1 1.00000 1.00000 1.000000 1.000000 1.000000 1.00000 100
OP2_1 278.48260 176.22273 96.462854 79.787932 76.904987 38.31767 100
OP2_2 280.90242 181.22013 98.545148 81.407928 77.637006 64.94842 100
OP2_3 279.43728 183.11366 114.879904 81.404236 82.614739 72.04741 100
Ak_3 15.74301 14.83731 9.340444 7.902213 8.164234 11.27133 100
Ak_3msw 18.57697 14.45574 12.936594 8.513725 8.997922 24.03969 100
停用词
morestopwords = c("a", "about", "above", "after", "again", "against", "all",
"am", "an", "and", "any", "are", "arent", "as", "at", "be", "because",
"been", "before", "being", "below", "between", "both", "but",
"by", "cant", "cannot", "could", "couldnt", "did", "didnt", "do",
"does", "doesnt", "doing", "dont", "down", "during", "each",
"few", "for", "from", "further", "had", "hadnt", "has", "hasnt",
"have", "havent", "having", "he", "hed", "hell", "hes", "her",
"here", "heres", "hers", "herself", "him", "himself", "his",
"how", "hows", "i", "id", "ill", "im", "ive", "if", "in", "into",
"is", "isnt", "it", "its", "its", "itself", "lets", "me", "more",
"most", "mustnt", "my", "myself", "no", "nor", "not", "of", "off",
"on", "once", "only", "or", "other", "ought", "our", "ours",
"ourselves", "out", "over", "own", "same", "shant", "she", "shed",
"shell", "shes", "should", "shouldnt", "so", "some", "such",
"than", "that", "thats", "the", "their", "theirs", "them", "themselves",
"then", "there", "theres", "these", "they", "theyd", "theyll",
"theyre", "theyve", "this", "those", "through", "to", "too",
"under", "until", "up", "very", "was", "wasnt", "we", "wed",
"well", "were", "weve", "were", "werent", "what", "whats", "when",
"whens", "where", "wheres", "which", "while", "who", "whos",
"whom", "why", "whys", "with", "wont", "would", "wouldnt", "you",
"youd", "youll", "youre", "youve", "your", "yours", "yourself",
"yourselves", "a", "b", "c", "d", "e", "f", "g", "h", "i", "j",
"k", "l", "m", "n", "o", "p", "q", "r", "s", "t", "u", "v", "w",
"x", "y", "z")
这里有一个更好的方法来做一些我已经可以低效地做的事情的呼吁:使用 "stop words" 过滤一系列 n-gram 标记,以便出现n-gram 中的任何停用词术语都会触发删除。
我非常希望有一个同时适用于 unigram 和 n-gram 的解决方案,尽管有两个版本也可以,一个带有 "fixed" 标志,一个带有 "regex" 标志。我将问题的两个方面放在一起,因为有人可能有一个解决方案尝试使用不同的方法来解决固定和正则表达式停用词模式。
格式:
tokens 是字符向量列表,可以是一元语法,也可以是由
_
(下划线)字符连接的 n-gram。stopwords 是一个字符向量。现在我满足于让它成为一个固定的字符串,但是如果能够使用正则表达式格式的停用词来实现它也是一个很好的奖励。
所需输出: 与输入 tokens 匹配的字符列表,但任何与停用词匹配的组件标记被删除。 (这意味着 unigram 匹配,或与 n-gram 包含的术语之一的匹配。)
示例、测试数据、工作代码和基准测试:
tokens1 <- list(text1 = c("this", "is", "a", "test", "text", "with", "a", "few", "words"),
text2 = c("some", "more", "words", "in", "this", "test", "text"))
tokens2 <- list(text1 = c("this_is", "is_a", "a_test", "test_text", "text_with", "with_a", "a_few", "few_words"),
text2 = c("some_more", "more_words", "words_in", "in_this", "this_text", "text_text"))
tokens3 <- list(text1 = c("this_is_a", "is_a_test", "a_test_text", "test_text_with", "text_with_a", "with_a_few", "a_few_words"),
text2 = c("some_more_words", "more_words_in", "words_in_this", "in_this_text", "this_text_text"))
stopwords <- c("is", "a", "in", "this")
# remove any single token that matches a stopword
removeTokensOP1 <- function(w, stopwords) {
lapply(w, function(x) x[-which(x %in% stopwords)])
}
# remove any word pair where a single word contains a stopword
removeTokensOP2 <- function(w, stopwords) {
matchPattern <- paste0("(^|_)", paste(stopwords, collapse = "(_|$)|(^|_)"), "(_|$)")
lapply(w, function(x) x[-grep(matchPattern, x)])
}
removeTokensOP1(tokens1, stopwords)
## $text1
## [1] "test" "text" "with" "few" "words"
##
## $text2
## [1] "some" "more" "words" "test" "text"
removeTokensOP2(tokens1, stopwords)
## $text1
## [1] "test" "text" "with" "few" "words"
##
## $text2
## [1] "some" "more" "words" "test" "text"
removeTokensOP2(tokens2, stopwords)
## $text1
## [1] "test_text" "text_with" "few_words"
##
## $text2
## [1] "some_more" "more_words" "text_text"
removeTokensOP2(tokens3, stopwords)
## $text1
## [1] "test_text_with"
##
## $text2
## [1] "some_more_words"
# performance benchmarks for answers to build on
require(microbenchmark)
microbenchmark(OP1_1 = removeTokensOP1(tokens1, stopwords),
OP2_1 = removeTokensOP2(tokens1, stopwords),
OP2_2 = removeTokensOP2(tokens2, stopwords),
OP2_3 = removeTokensOP2(tokens3, stopwords),
unit = "relative")
## Unit: relative
## expr min lq mean median uq max neval
## OP1_1 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 100
## OP2_1 5.119066 3.812845 3.438076 3.714492 3.547187 2.838351 100
## OP2_2 5.230429 3.903135 3.509935 3.790143 3.631305 2.510629 100
## OP2_3 5.204924 3.884746 3.578178 3.753979 3.553729 8.240244 100
如果您使用 parallel
包在列表中有很多级别,我们可以改进 lapply
。
创建多个关卡
tokens2 <- list(text1 = c("this_is", "is_a", "a_test", "test_text", "text_with", "with_a", "a_few", "few_words"),
text2 = c("some_more", "more_words", "words_in", "in_this", "this_text", "text_text"))
tokens2 <- lapply(1:500,function(x) sample(tokens2,1)[[1]])
我们这样做是因为并行包有很多开销需要设置,因此仅仅增加微基准测试的迭代次数就会继续产生该成本。通过增加列表的大小,您会看到真正的改进。
library(parallel)
#Setup
cl <- detectCores()
cl <- makeCluster(cl)
#Two functions:
#original
removeTokensOP2 <- function(w, stopwords) {
matchPattern <- paste0("(^|_)", paste(stopwords, collapse = "(_|$)|(^|_)"), "(_|$)")
lapply(w, function(x) x[-grep(matchPattern, x)])
}
#new
removeTokensOPP <- function(w, stopwords) {
matchPattern <- paste0("(^|_)", paste(stopwords, collapse = "(_|$)|(^|_)"), "(_|$)")
return(w[-grep(matchPattern, w)])
}
#compare
microbenchmark(
OP2_P = parLapply(cl,tokens2,removeTokensOPP,stopwords),
OP2_2 = removeTokensOP2(tokens2, stopwords),
unit = 'relative'
)
Unit: relative
expr min lq mean median uq max neval
OP2_P 1.000000 1.000000 1.000000 1.000000 1.000000 1.00000 100
OP2_2 1.730565 1.653872 1.678781 1.562258 1.471347 10.11306 100
随着列表中级别数的增加,性能将会提高。
您可以考虑简化正则表达式,^ 和 $ 会增加开销
remove_short <- function(x, stopwords) {
stopwords_regexp <- paste0('(^|_)(', paste(stopwords, collapse = '|'), ')(_|$)')
lapply(x, function(x) x[!grepl(stopwords_regexp, x)])
}
require(microbenchmark)
microbenchmark(OP1_1 = removeTokensOP1(tokens1, stopwords),
OP2_1 = removeTokensOP2(tokens2, stopwords),
OP2_2 = remove_short(tokens2, stopwords),
unit = "relative")
Unit: relative
expr min lq mean median uq max neval cld
OP1_1 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 100 a
OP2_1 5.178565 4.768749 4.465138 4.441130 4.262399 4.266905 100 c
OP2_2 3.452386 3.247279 3.063660 3.068571 2.963794 2.948189 100 b
这并不是一个真正的答案 - 更多的是回复 rawr 关于遍历所有停用词组合的评论的评论。对于更长的 stopwords
列表,使用 %in%
之类的东西似乎不会遇到维度问题。
library(purrr)
removetokenstst <- function(tokens, stopwords)
map2(tokens,
lapply(tokens3, function(x) {
unlist(lapply(strsplit(x, "_"), function(y) {
any(y %in% stopwords)
}))
}),
~ .x[!.y])
require(microbenchmark)
microbenchmark(OP1_1 = removeTokensOP1(tokens1, morestopwords),
OP2_1 = removeTokensOP2(tokens1, morestopwords),
OP2_2 = removeTokensOP2(tokens2, morestopwords),
OP2_3 = removeTokensOP2(tokens3, morestopwords),
Ak_3 = removetokenstst(tokens3, stopwords),
Ak_3msw = removetokenstst(tokens3, morestopwords),
unit = "relative")
Unit: relative
expr min lq mean median uq max neval
OP1_1 1.00000 1.00000 1.000000 1.000000 1.000000 1.00000 100
OP2_1 278.48260 176.22273 96.462854 79.787932 76.904987 38.31767 100
OP2_2 280.90242 181.22013 98.545148 81.407928 77.637006 64.94842 100
OP2_3 279.43728 183.11366 114.879904 81.404236 82.614739 72.04741 100
Ak_3 15.74301 14.83731 9.340444 7.902213 8.164234 11.27133 100
Ak_3msw 18.57697 14.45574 12.936594 8.513725 8.997922 24.03969 100
停用词
morestopwords = c("a", "about", "above", "after", "again", "against", "all",
"am", "an", "and", "any", "are", "arent", "as", "at", "be", "because",
"been", "before", "being", "below", "between", "both", "but",
"by", "cant", "cannot", "could", "couldnt", "did", "didnt", "do",
"does", "doesnt", "doing", "dont", "down", "during", "each",
"few", "for", "from", "further", "had", "hadnt", "has", "hasnt",
"have", "havent", "having", "he", "hed", "hell", "hes", "her",
"here", "heres", "hers", "herself", "him", "himself", "his",
"how", "hows", "i", "id", "ill", "im", "ive", "if", "in", "into",
"is", "isnt", "it", "its", "its", "itself", "lets", "me", "more",
"most", "mustnt", "my", "myself", "no", "nor", "not", "of", "off",
"on", "once", "only", "or", "other", "ought", "our", "ours",
"ourselves", "out", "over", "own", "same", "shant", "she", "shed",
"shell", "shes", "should", "shouldnt", "so", "some", "such",
"than", "that", "thats", "the", "their", "theirs", "them", "themselves",
"then", "there", "theres", "these", "they", "theyd", "theyll",
"theyre", "theyve", "this", "those", "through", "to", "too",
"under", "until", "up", "very", "was", "wasnt", "we", "wed",
"well", "were", "weve", "were", "werent", "what", "whats", "when",
"whens", "where", "wheres", "which", "while", "who", "whos",
"whom", "why", "whys", "with", "wont", "would", "wouldnt", "you",
"youd", "youll", "youre", "youve", "your", "yours", "yourself",
"yourselves", "a", "b", "c", "d", "e", "f", "g", "h", "i", "j",
"k", "l", "m", "n", "o", "p", "q", "r", "s", "t", "u", "v", "w",
"x", "y", "z")