使用循环从 R 中的大数据帧中删除停用词需要太多时间
removing stopwords from a big dataframe in R using loops takes too much time
我正在尝试从 R 中的大数据帧(1200 万行)中删除停用词。我尝试将它执行到一个 30k 行的数据帧并且它工作得很好(它在 2 分钟内完成)。对于一个 300k 行的数据帧它需要太多时间(大约 4 小时)但我需要执行它一个12m行的数据框,我只想知道有没有其他方法可以做到这一点(可能循环导致速度变慢)
trait_text函数定义在代码区
removeWords 是一个预定义的 R 函数,可以从 varchar 中删除停用词。
同一上下文中的另一个问题:
我需要迁移到 RStudio 64 位吗?因为 32 位版本没有使用机器上所有可用的 RAM。
#define stopwords
stop<-c("MONSIEUR","MADAME","MR","MME","M","SARL","SA","EARL","EURL","SCI","SAS","ETS","STE","SARLU", "SASU","CFA","ATS","GAEC","COMMUNE","SOCIETE",toupper(stopwords::stopwords("fr", source = "snowball")))
##trait text :
#Remove Multiple spaces
del_multispace = function(text) {
return(text <- gsub("\s+", " ", text))
}
#Remove Ponctuation
del_punctuation = function(text) {
text <- gsub("[[:punct:]]", "", text)
}
#Remove accents
del_accent = function(text) {
text <- gsub("['`^~\"]", " ", text)
text <- iconv(text, from = "UTF-8", to = "ASCII//TRANSLIT//IGNORE")
text <- gsub("['`^~\"]", "", text)
return(text)
}
trait_text=function(text) {
text = del_multispace(text)
text = del_punctuation(text)
text = del_accent(text)
}
#remove stopwords for data :
system.time(for (i in 1:nrow(test_data)) {
print(paste("client n: ",i))
x<-removeWords(trait_text(test_data$ref[i]),stop)
#output
test_data$ref[i]<-gdata::trim(paste(x, collapse = ' '))
})
Sample test_data with desired output :
ref ouptut
1 |"LE LA ONE" | "ONE"
2 |"SAS TWO" | "TWO"
3 |"MR THREE" | "THREE"
我想出了一个解决我的问题的方法,可以完美地避免循环。
代码如下:
library(tm)
library(gdata)
#stopwords
stop<-c("MONSIEUR","MADAME","MR","MME","M","SARL","SA","EARL","EURL","SCI","SAS","ETS","STE","SARLU","SASU","CFA","ATS","GAEC","COMMUNE","SOCIETE",toupper(stopwords::stopwords("fr", source = "snowball")))
#Remove multiple spaces
del_multispace = function(text) {
return(text <- gsub("\s+", " ", text))
}
#Remove punctuation
del_punctuation = function(text) {
return(text <- gsub("[[:punct:]]", "", text))
}
#Remove accents
del_accent = function(text) {
text <- gsub("['`^~\"]", " ", text)
text <- iconv(text, from = "UTF-8", to = "ASCII//TRANSLIT//IGNORE")
text <- gsub("['`^~\"]", "", text)
return(text)
}
#remove stopwords from text
del_stopwords=function(text) {
text<-removeWords(text,stop)
return(text)
}
#Cleaning function :
trait_text=function(text) {
text = del_multispace(text)
text = del_punctuation(text)
text = del_accent(text)
text = del_stopwords(text)
}
#remove stopwords from test_data:
system.time(test_data$x<-trim(trait_text(test_data$ref)))
我正在尝试从 R 中的大数据帧(1200 万行)中删除停用词。我尝试将它执行到一个 30k 行的数据帧并且它工作得很好(它在 2 分钟内完成)。对于一个 300k 行的数据帧它需要太多时间(大约 4 小时)但我需要执行它一个12m行的数据框,我只想知道有没有其他方法可以做到这一点(可能循环导致速度变慢)
trait_text函数定义在代码区 removeWords 是一个预定义的 R 函数,可以从 varchar 中删除停用词。
同一上下文中的另一个问题: 我需要迁移到 RStudio 64 位吗?因为 32 位版本没有使用机器上所有可用的 RAM。
#define stopwords
stop<-c("MONSIEUR","MADAME","MR","MME","M","SARL","SA","EARL","EURL","SCI","SAS","ETS","STE","SARLU", "SASU","CFA","ATS","GAEC","COMMUNE","SOCIETE",toupper(stopwords::stopwords("fr", source = "snowball")))
##trait text :
#Remove Multiple spaces
del_multispace = function(text) {
return(text <- gsub("\s+", " ", text))
}
#Remove Ponctuation
del_punctuation = function(text) {
text <- gsub("[[:punct:]]", "", text)
}
#Remove accents
del_accent = function(text) {
text <- gsub("['`^~\"]", " ", text)
text <- iconv(text, from = "UTF-8", to = "ASCII//TRANSLIT//IGNORE")
text <- gsub("['`^~\"]", "", text)
return(text)
}
trait_text=function(text) {
text = del_multispace(text)
text = del_punctuation(text)
text = del_accent(text)
}
#remove stopwords for data :
system.time(for (i in 1:nrow(test_data)) {
print(paste("client n: ",i))
x<-removeWords(trait_text(test_data$ref[i]),stop)
#output
test_data$ref[i]<-gdata::trim(paste(x, collapse = ' '))
})
Sample test_data with desired output :
ref ouptut
1 |"LE LA ONE" | "ONE"
2 |"SAS TWO" | "TWO"
3 |"MR THREE" | "THREE"
我想出了一个解决我的问题的方法,可以完美地避免循环。
代码如下:
library(tm)
library(gdata)
#stopwords
stop<-c("MONSIEUR","MADAME","MR","MME","M","SARL","SA","EARL","EURL","SCI","SAS","ETS","STE","SARLU","SASU","CFA","ATS","GAEC","COMMUNE","SOCIETE",toupper(stopwords::stopwords("fr", source = "snowball")))
#Remove multiple spaces
del_multispace = function(text) {
return(text <- gsub("\s+", " ", text))
}
#Remove punctuation
del_punctuation = function(text) {
return(text <- gsub("[[:punct:]]", "", text))
}
#Remove accents
del_accent = function(text) {
text <- gsub("['`^~\"]", " ", text)
text <- iconv(text, from = "UTF-8", to = "ASCII//TRANSLIT//IGNORE")
text <- gsub("['`^~\"]", "", text)
return(text)
}
#remove stopwords from text
del_stopwords=function(text) {
text<-removeWords(text,stop)
return(text)
}
#Cleaning function :
trait_text=function(text) {
text = del_multispace(text)
text = del_punctuation(text)
text = del_accent(text)
text = del_stopwords(text)
}
#remove stopwords from test_data:
system.time(test_data$x<-trim(trait_text(test_data$ref)))