尝试将单词列表与 R 中的句子列表匹配时出现性能问题
Performance issue while trying to match a list of words with a list of sentences in R
我正在尝试将单词列表与句子列表进行匹配,并使用匹配的单词和句子形成数据框。例如:
words <- c("far better","good","great","sombre","happy")
sentences <- c("This document is far better","This is a great app","The night skies were sombre and starless", "The app is too good and i am happy using it", "This is how it works")
预期结果(一个dataframe)如下:
sentences words
This document is far better better
This is a great app great
The night skies were sombre and starless sombre
The app is too good and i am happy using it good, happy
This is how it works -
我正在使用以下代码来实现这一点。
lengthOfData <- nrow(sentence_df)
pos.words <- polarity_table[polarity_table$y>0]$x
neg.words <- polarity_table[polarity_table$y<0]$x
positiveWordsList <- list()
negativeWordsList <- list()
for(i in 1:lengthOfData){
sentence <- sentence_df[i,]$comment
#sentence <- gsub('[[:punct:]]', "", sentence)
#sentence <- gsub('[[:cntrl:]]', "", sentence)
#sentence <- gsub('\d+', "", sentence)
sentence <- tolower(sentence)
# get unigrams from the sentence
unigrams <- unlist(strsplit(sentence, " ", fixed=TRUE))
# get bigrams from the sentence
bigrams <- unlist(lapply(1:length(unigrams)-1, function(i) {paste(unigrams[i],unigrams[i+1])} ))
# .. and combine into data frame
words <- c(unigrams, bigrams)
#if(sentence_df[i,]$ave_sentiment)
pos.matches <- match(words, pos.words)
neg.matches <- match(words, neg.words)
pos.matches <- na.omit(pos.matches)
neg.matches <- na.omit(neg.matches)
positiveList <- pos.words[pos.matches]
negativeList <- neg.words[neg.matches]
if(length(positiveList)==0){
positiveList <- c("-")
}
if(length(negativeList)==0){
negativeList <- c("-")
}
negativeWordsList[i]<- paste(as.character(unique(negativeList)), collapse=", ")
positiveWordsList[i]<- paste(as.character(unique(positiveList)), collapse=", ")
positiveWordsList[i] <- sapply(positiveWordsList[i], function(x) toString(x))
negativeWordsList[i] <- sapply(negativeWordsList[i], function(x) toString(x))
}
positiveWordsList <- as.vector(unlist(positiveWordsList))
negativeWordsList <- as.vector(unlist(negativeWordsList))
scores.df <- data.frame(ave_sentiment=sentence_df$ave_sentiment, comment=sentence_df$comment,pos=positiveWordsList,neg=negativeWordsList, year=sentence_df$year,month=sentence_df$month,stringsAsFactors = FALSE)
我有 28k 个句子和 65k 个单词可以匹配。上面的代码需要 45 秒才能完成任务。关于如何提高代码性能的任何建议,因为当前方法需要花费大量时间?
编辑:
我只想得到那些与句子中的单词完全匹配的单词。例如:
words <- c('sin','vice','crashes')
sentences <- ('Since the app crashes frequently, I advice you guys to fix the issue ASAP')
现在对于上述情况我的输出应该如下:
sentences words
Since the app crashes frequently, I advice you guys to fix crahses
the issue ASAP
我能够使用@David Arenburg 的答案进行一些修改。这是我所做的。我使用以下(David 建议的)来形成数据框。
df <- data.frame(sentences) ;
df$words <- sapply(sentences, function(x) toString(words[stri_detect_fixed(x, words)]))
上述方法的问题在于它没有进行精确的单词匹配。
所以我用下面的过滤掉了和句子中的词不完全匹配的词。
df <- data.frame(fil=unlist(s),text=rep(df$sentence, sapply(s, FUN=length)))
应用上述行后,输出数据框发生如下变化。
sentences words
This document is far better better
This is a great app great
The night skies were sombre and starless sombre
The app is too good and i am happy using it good
The app is too good and i am happy using it happy
This is how it works -
Since the app crashes frequently, I advice you guys to fix
the issue ASAP crahses
Since the app crashes frequently, I advice you guys to fix
the issue ASAP vice
Since the app crashes frequently, I advice you guys to fix
the issue ASAP sin
现在将以下过滤器应用于数据框,以删除与句子中出现的那些词不完全匹配的那些词。
df <- df[apply(df, 1, function(x) tolower(x[1]) %in% tolower(unlist(strsplit(x[2], split='\s+')))),]
现在我得到的数据框如下。
sentences words
This document is far better better
This is a great app great
The night skies were sombre and starless sombre
The app is too good and i am happy using it good
The app is too good and i am happy using it happy
This is how it works -
Since the app crashes frequently, I advice you guys to fix
the issue ASAP crahses
stri_detect_fixed 大大减少了我的计算时间。剩下的过程并没有占用太多时间。感谢@David 为我指明了正确的方向。
您可以在最新版本的 sentimentr 中使用 extract_sentiment_terms
执行此操作,但您必须先创建一个情感键并为单词赋值:
pos <- c("far better","good","great","sombre","happy")
neg <- c('sin','vice','crashes')
sentences <- c('Since the app crashes frequently, I advice you guys to fix the issue ASAP',
"This document is far better", "This is a great app","The night skies were sombre and starless",
"The app is too good and i am happy using it", "This is how it works")
library(sentimentr)
(sentkey <- as_key(data.frame(c(pos, neg), c(rep(1, length(pos)), rep(-1, length(neg))), stringsAsFactors = FALSE)))
## x y
## 1: crashes -1
## 2: far better 1
## 3: good 1
## 4: great 1
## 5: happy 1
## 6: sin -1
## 7: sombre 1
## 8: vice -1
extract_sentiment_terms(sentences, sentkey)
## element_id sentence_id negative positive
## 1: 1 1 crashes
## 2: 2 1 far better
## 3: 3 1 great
## 4: 4 1 sombre
## 5: 5 1 good,happy
## 6: 6 1
我正在尝试将单词列表与句子列表进行匹配,并使用匹配的单词和句子形成数据框。例如:
words <- c("far better","good","great","sombre","happy")
sentences <- c("This document is far better","This is a great app","The night skies were sombre and starless", "The app is too good and i am happy using it", "This is how it works")
预期结果(一个dataframe)如下:
sentences words
This document is far better better
This is a great app great
The night skies were sombre and starless sombre
The app is too good and i am happy using it good, happy
This is how it works -
我正在使用以下代码来实现这一点。
lengthOfData <- nrow(sentence_df)
pos.words <- polarity_table[polarity_table$y>0]$x
neg.words <- polarity_table[polarity_table$y<0]$x
positiveWordsList <- list()
negativeWordsList <- list()
for(i in 1:lengthOfData){
sentence <- sentence_df[i,]$comment
#sentence <- gsub('[[:punct:]]', "", sentence)
#sentence <- gsub('[[:cntrl:]]', "", sentence)
#sentence <- gsub('\d+', "", sentence)
sentence <- tolower(sentence)
# get unigrams from the sentence
unigrams <- unlist(strsplit(sentence, " ", fixed=TRUE))
# get bigrams from the sentence
bigrams <- unlist(lapply(1:length(unigrams)-1, function(i) {paste(unigrams[i],unigrams[i+1])} ))
# .. and combine into data frame
words <- c(unigrams, bigrams)
#if(sentence_df[i,]$ave_sentiment)
pos.matches <- match(words, pos.words)
neg.matches <- match(words, neg.words)
pos.matches <- na.omit(pos.matches)
neg.matches <- na.omit(neg.matches)
positiveList <- pos.words[pos.matches]
negativeList <- neg.words[neg.matches]
if(length(positiveList)==0){
positiveList <- c("-")
}
if(length(negativeList)==0){
negativeList <- c("-")
}
negativeWordsList[i]<- paste(as.character(unique(negativeList)), collapse=", ")
positiveWordsList[i]<- paste(as.character(unique(positiveList)), collapse=", ")
positiveWordsList[i] <- sapply(positiveWordsList[i], function(x) toString(x))
negativeWordsList[i] <- sapply(negativeWordsList[i], function(x) toString(x))
}
positiveWordsList <- as.vector(unlist(positiveWordsList))
negativeWordsList <- as.vector(unlist(negativeWordsList))
scores.df <- data.frame(ave_sentiment=sentence_df$ave_sentiment, comment=sentence_df$comment,pos=positiveWordsList,neg=negativeWordsList, year=sentence_df$year,month=sentence_df$month,stringsAsFactors = FALSE)
我有 28k 个句子和 65k 个单词可以匹配。上面的代码需要 45 秒才能完成任务。关于如何提高代码性能的任何建议,因为当前方法需要花费大量时间?
编辑:
我只想得到那些与句子中的单词完全匹配的单词。例如:
words <- c('sin','vice','crashes')
sentences <- ('Since the app crashes frequently, I advice you guys to fix the issue ASAP')
现在对于上述情况我的输出应该如下:
sentences words
Since the app crashes frequently, I advice you guys to fix crahses
the issue ASAP
我能够使用@David Arenburg 的答案进行一些修改。这是我所做的。我使用以下(David 建议的)来形成数据框。
df <- data.frame(sentences) ;
df$words <- sapply(sentences, function(x) toString(words[stri_detect_fixed(x, words)]))
上述方法的问题在于它没有进行精确的单词匹配。 所以我用下面的过滤掉了和句子中的词不完全匹配的词。
df <- data.frame(fil=unlist(s),text=rep(df$sentence, sapply(s, FUN=length)))
应用上述行后,输出数据框发生如下变化。
sentences words
This document is far better better
This is a great app great
The night skies were sombre and starless sombre
The app is too good and i am happy using it good
The app is too good and i am happy using it happy
This is how it works -
Since the app crashes frequently, I advice you guys to fix
the issue ASAP crahses
Since the app crashes frequently, I advice you guys to fix
the issue ASAP vice
Since the app crashes frequently, I advice you guys to fix
the issue ASAP sin
现在将以下过滤器应用于数据框,以删除与句子中出现的那些词不完全匹配的那些词。
df <- df[apply(df, 1, function(x) tolower(x[1]) %in% tolower(unlist(strsplit(x[2], split='\s+')))),]
现在我得到的数据框如下。
sentences words
This document is far better better
This is a great app great
The night skies were sombre and starless sombre
The app is too good and i am happy using it good
The app is too good and i am happy using it happy
This is how it works -
Since the app crashes frequently, I advice you guys to fix
the issue ASAP crahses
stri_detect_fixed 大大减少了我的计算时间。剩下的过程并没有占用太多时间。感谢@David 为我指明了正确的方向。
您可以在最新版本的 sentimentr 中使用 extract_sentiment_terms
执行此操作,但您必须先创建一个情感键并为单词赋值:
pos <- c("far better","good","great","sombre","happy")
neg <- c('sin','vice','crashes')
sentences <- c('Since the app crashes frequently, I advice you guys to fix the issue ASAP',
"This document is far better", "This is a great app","The night skies were sombre and starless",
"The app is too good and i am happy using it", "This is how it works")
library(sentimentr)
(sentkey <- as_key(data.frame(c(pos, neg), c(rep(1, length(pos)), rep(-1, length(neg))), stringsAsFactors = FALSE)))
## x y
## 1: crashes -1
## 2: far better 1
## 3: good 1
## 4: great 1
## 5: happy 1
## 6: sin -1
## 7: sombre 1
## 8: vice -1
extract_sentiment_terms(sentences, sentkey)
## element_id sentence_id negative positive
## 1: 1 1 crashes
## 2: 2 1 far better
## 3: 3 1 great
## 4: 4 1 sombre
## 5: 5 1 good,happy
## 6: 6 1