使用循环从 R 中的大数据帧中删除停用词需要太多时间

Question

我正在尝试从 R 中的大数据帧（1200 万行）中删除停用词。我尝试将它执行到一个 30k 行的数据帧并且它工作得很好（它在 2 分钟内完成）。对于一个 300k 行的数据帧它需要太多时间（大约 4 小时）但我需要执行它一个12m行的数据框，我只想知道有没有其他方法可以做到这一点（可能循环导致速度变慢）

trait_text函数定义在代码区 removeWords 是一个预定义的 R 函数，可以从 varchar 中删除停用词。

同一上下文中的另一个问题：我需要迁移到 RStudio 64 位吗？因为 32 位版本没有使用机器上所有可用的 RAM。

#define stopwords
stop<-c("MONSIEUR","MADAME","MR","MME","M","SARL","SA","EARL","EURL","SCI","SAS","ETS","STE","SARLU",     "SASU","CFA","ATS","GAEC","COMMUNE","SOCIETE",toupper(stopwords::stopwords("fr", source = "snowball")))


##trait text :

#Remove Multiple spaces
del_multispace = function(text) {
  return(text <- gsub("\s+", " ", text))
}

#Remove Ponctuation
del_punctuation = function(text) {
  text <- gsub("[[:punct:]]", "", text)
}

#Remove accents 
del_accent = function(text) {
  text <- gsub("['`^~\"]", " ", text)
  text <- iconv(text, from = "UTF-8", to = "ASCII//TRANSLIT//IGNORE")
  text <- gsub("['`^~\"]", "", text)
  return(text)
}


trait_text=function(text) {

  text = del_multispace(text)
  text = del_punctuation(text)
  text = del_accent(text)

}

#remove stopwords for data :
system.time(for (i in 1:nrow(test_data)) {

  print(paste("client n: ",i))
  x<-removeWords(trait_text(test_data$ref[i]),stop)


  #output
  test_data$ref[i]<-gdata::trim(paste(x, collapse = ' '))

})

Sample test_data with desired output :


      ref        ouptut 
1 |"LE LA ONE" | "ONE"
2 |"SAS TWO"   | "TWO"
3 |"MR THREE"  | "THREE"

Answer 1

我想出了一个解决我的问题的方法，可以完美地避免循环。

代码如下：


library(tm)
library(gdata)


#stopwords
stop<-c("MONSIEUR","MADAME","MR","MME","M","SARL","SA","EARL","EURL","SCI","SAS","ETS","STE","SARLU","SASU","CFA","ATS","GAEC","COMMUNE","SOCIETE",toupper(stopwords::stopwords("fr", source = "snowball")))


#Remove multiple spaces
del_multispace = function(text) {
  return(text <- gsub("\s+", " ", text))
}

#Remove punctuation 
del_punctuation = function(text) {
  return(text <- gsub("[[:punct:]]", "", text))
}

#Remove accents
del_accent = function(text) {
  text <- gsub("['`^~\"]", " ", text)
  text <- iconv(text, from = "UTF-8", to = "ASCII//TRANSLIT//IGNORE")
  text <- gsub("['`^~\"]", "", text)
  return(text)
}

#remove stopwords from text
del_stopwords=function(text) {

  text<-removeWords(text,stop)
  return(text)
}


#Cleaning function :
trait_text=function(text) {

  text = del_multispace(text)
  text = del_punctuation(text)
  text = del_accent(text)
  text = del_stopwords(text)
}


#remove stopwords from test_data:

system.time(test_data$x<-trim(trait_text(test_data$ref)))

使用循环从 R 中的大数据帧中删除停用词需要太多时间

removing stopwords from a big dataframe in R using loops takes too much time

loops

r

stop-words