自定义停用词列表删除

Question

我尝试使用自定义单词列表从文本中删除短语。

这是一个可重现的例子。

我认为我的尝试有些不对：

mystop <-  structure(list(stopwords = c("remove", "this line", "remove this line", 
"two lines")), .Names = "stopwords", class = "data.frame", row.names = c(NA, 
-4L))
df <-  structure(list(stopwords = c("Something to remove", "this line must remove two tokens", 
"remove this line must remove three tokens", "two lines to", 
"nothing here to stop")), .Names = "stopwords", class = "data.frame", row.names = c(NA, 
-5L))
> mycorpus <- corpus(df$stopwords)
> mydfm <- dfm(tokens_remove(tokens(df$stopwords, remove_punct = TRUE), c(stopwords("SMART"), mystop$stopwords)), , ngrams = c(1,3))
> 
> 
> #convert the dfm to dataframe
> df_ngram <- data.frame(Content = featnames(mydfm), Frequency = colSums(mydfm), 
+                  row.names = NULL, stringsAsFactors = FALSE)
> 
> df_ngram
  Content Frequency
1    line         2
2  tokens         2
3   lines         1
4    stop         1
> df
                                  stopwords
1                       Something to remove
2          this line must remove two tokens
3 remove this line must remove three tokens
4                              two lines to
5                      nothing here to stop

dfm 中的示例我应该期望找到类似这样的内容 Something to？我的意思是看到每个文件都清楚而不删除？

我想从 ngram 标记中删除特征停用词。所以我试着用这个：

mydfm2 <- dfm(tokens_remove(tokens(df$stopwords, remove_punct = TRUE, ngrams = 1:3), remove = c(stopwords("english"), mystop$stopwords)))
Error in tokens_select(x, ..., selection = "remove") : 
  unused argument (remove = c("i", "me", "my", "myself", "we", "our", "ours", "ourselves", "you", "your", "yours", "yourself", "yourselves", "he", "him", "his", "himself", "she", "her", "hers", "herself", "it", "its", "itself", "they", "them", "their", "theirs", "themselves", "what", "which", "who", "whom", "this", "that", "these", "those", "am", "is", "are", "was", "were", "be", "been", "being", "have", "has", "had", "having", "do", "does", "did", "doing", "would", "should", "could", "ought", "i'm", "you're", 
"he's", "she's", "it's", "we're", "they're", "i've", "you've", "we've", "they've", "i'd", "you'd", "he'd", "she'd", "we'd", "they'd", "i'll", "you'll", "he'll", "she'll", "we'll", "they'll", "isn't", "aren't", "wasn't", "weren't", "hasn't", "haven't", "hadn't", "doesn't", "don't", "didn't", "won't", "wouldn't", "shan't", "shouldn't", "can't", "cannot", "couldn't", "mustn't", "let's", "that's", "who's", "what's", "here's", "there's", "when's", "where's", "why's", "how's",

使用其他示例可重现代码进行编辑：这是我从其他问题中找到的虚拟文本：

df <- structure(list(text = c("video game consoles stereos smartphone chargers and other similar devices constantly draw power into their power supplies. Unplug all of your chargers whether it's for a tablet or a toothbrush. Electronics with standby or \\"\\"sleep\\"\\" modes: Desktop PCs televisions cable boxes DVD-ray players alarm clocks radios and anything with a remote", 
"...its judgment and order dated 02.05.2016 in Modern Dental College Research Centre (supra) authorizing it to oversee all statutory functions under the Act and leaving it at liberty to issue appropriate remedial directions the impugned order is in the teeth of the recommendations of the said Committee as communicated in its letter dated 14.05.2017", 
"... focus to the ayurveda sector especially in oral care. A year ago Colgate launched its first India-focused ayurvedic brand Cibaca Vedshakti aimed squarely at countering Dant Kanti. HUL too launched araft of ayurvedic personal care products including toothpaste under the Ayush brand. RIVAL TO WATCH OUT FOR Colgate Palmolive global CEO Ian", 
"...founder of Increate Value Advisors. Patanjali has brought the focus back on product efficacy. Rising above the noise of advertising products have to first deliver value to the consumers. Ghee and tooth paste are the two most popular products of Patanjali  even though both of these have enough local and multinational competitors in the organised", 
"The Bombay High Court today came down heavily on the Maharashtra government for not providing space and or hiring enough employees for the State Human Rights Commission. The commission has been left a toothless tiger as due to a lack of space and employees it has not been able to hear cases of human rights violations in Maharashtra. A division"
)), .Names = "text", class = "data.frame", row.names = c(NA, 
-5L))

停用词（我使用 quanteda 的 ngram 创建了这个列表）

mystop <- structure(list(stop = c("dated_modern_dental", "hiring", "local", 
"employees", "modern_dental_college", "multinational", "competitors", 
"state", "dental_college_research", "organised", "human", "rights", 
"college_research_centre", "commission", "founder_increate_advisors", 
"research_centre_supra", "sector_oral_care", "left", "toothless", 
"centre_supra_authorizing")), .Names = "stop", class = "data.frame", row.names = c(NA, 
-20L))

代码中的所有步骤：

library (quanteda)
library(stringr)
#text to lower case
df$text <- tolower(df$text)
#remove all special characters
df$text <- gsub("[[:punct:]]", " ", df$text)
#remove numbers
df$text <- gsub('[0-9]+', '', df$text)
#more in order to remove regular expressions like chinese characters
df$text <- str_replace_all(df$text, "[^[:alnum:]]", " ")
#remove long spaces
df$text <- gsub("\s+"," ",str_trim(df$text))

这是我制作 ngram 的步骤，同时结合我的停用词列表从输入文本中删除英语停用词。

myDfm <- dfm(tokens_remove(tokens(df$text, remove_punct = TRUE),  c(stopwords("SMART"), mystop$stop)), ngrams = c(1,3))

但是，如果我将 myDfm 转换为数据集以查看删除停用词是否有效并且可以再次看到它们

df_ngram <- data.frame(Content = featnames(myDfm), Frequency = colSums(myDfm), 
                 row.names = NULL, stringsAsFactors = FALSE)

Answer 1

我会尽量提供我认为你想要的答案，尽管很难理解你的问题，因为实际问题隐藏在一系列与问题不直接相关的大部分不必要的步骤中。

我认为您对如何删除停用词（在本例中是您提供的一些停用词）并形成 ngrams 感到困惑。

以下是创建语料库和停用词特征向量的方法。不需要列表等。请注意，这是针对 quanteda v1.0.0 的，它现在使用 stopwords 包作为其停用词列表。

mycorpus <- corpus(df$stopwords)
mystopwords <- c(stopwords(source = "smart"), mystop$stopwords)

现在我们可以手动构建标记，删除停用词但在它们的位置留下 "pad"，以防止从从不相邻的单词创建 ngrams。

mytoks <- 
    tokens(mycorpus) %>%
    tokens_remove(mystopwords, padding = TRUE)
mytoks
# tokens from 5 documents.
# text1 :
# [1] "" "" ""
# 
# text2 :
# [1] ""       "line"   ""       ""       ""       "tokens"
# 
# text3 :
# [1] ""       ""       "line"   ""       ""       ""       "tokens"
# 
# text4 :
# [1] ""      "lines" ""     
# 
# text5 :
# [1] ""     ""     ""     "stop"

在此阶段，我们还可以使用 tokens_ngrams() 或 dfm() 中的 ngrams 选项来应用 ngram。让我们使用后者。

dfm(mytoks, ngrams = c(1,3))
# Document-feature matrix of: 5 documents, 4 features (70% sparse).
# 5 x 4 sparse Matrix of class "dfm"
#        features
# docs    line tokens lines stop
#   text1    0      0     0    0
#   text2    1      1     0    0
#   text3    1      1     0    0
#   text4    0      0     1    0
#   text5    0      0     0    1

没有创建 ngram，因为您可以从上面的标记打印输出中看到，在从 mystopwords 向量中删除停用词后，没有与其他标记相邻的剩余标记。

自定义停用词列表删除

Customized stopword list remove

r

quanteda