词干化每个词
Stemming each word
我想把每个词都干掉。例如,“hardworking employees
”应转换为“hardwork employee
”而不是“hardworking employee
”。简而言之,它应该将两个词分开。我知道这没有意义。但就是这样的例子。实际上,我有一些医学词汇,其中这种词干提取是有意义的。
我有使用定界符 ',' 考虑单词然后执行词干提取的功能。我希望对其进行修改,以便可以对 ',' delimeter.
内的所有单词执行词干提取
dt = read.table(header = TRUE,
text ="Word Synonyms
employee 'hardworking employees, intelligent employees, employment, employee'
lover 'loved ones, loving boy, lover'
", stringsAsFactors= F)
library(SnowballC)
library(parallel)
stem_text3<- function(text, language = "english", mc.cores = 3) {
stem_string <- function(str, language) {
str <- strsplit(x = str, split = "\,")
str <- wordStem(unlist(str), language = language)
str <- paste(str, collapse = ",")
return(str)
}
# stem each text block in turn
x <- mclapply(X = text, FUN = stem_string, language)
# return stemed text blocks
return(unlist(x))
}
df000 <- data.frame(stringsAsFactors = F)
for (i in 1:nrow(dt)){
sent = dt[i, "Synonyms"]
k = data.frame(r_synonyms = stem_text3(sent, language = 'en'), stringsAsFactors = F)
df000= rbind(df000,k)
}
这很棘手,因为 SnowballC::wordStem()
对字符向量的每个元素进行词干处理,因此您的字符向量需要拆分并重新组合才能使用。
我会放弃循环并使用应用操作对其进行矢量化(您可以将其交换为 mclapply()
。
library("stringi")
dt[["Synonyms"]] <-
sapply(stri_split_fixed(dt[["Synonyms"]], ","), function(x) {
x <- lapply(stri_split_fixed(stri_trim_both(x), " "), function(y) {
paste(SnowballC::wordStem(y), collapse = " ")
})
paste(x, collapse = ", ")
})
dt
## Word Synonyms
## 1 employee hardwork employe, intellig employe, employ, employe
## 2 lover love on, love boi, lover
备注:
首先,我认为这不是您所期望的词干,但这就是 Porter 词干分析器在 SnowballC.
中实现的工作方式
其次,有更好的方法来整体构建这个问题,但我无法真正回答这个问题,除非你在问这个问题时解释你的 objective。要替换一组短语(用通配符代替词干提取),例如,在 quanteda 中,您可以执行以下操作:
library("quanteda")
thedict <- dictionary(list(
employee = c("hardwork* employ*", "intellig* employ*", "employment", "employee*"),
lover = c("lov* ones", "lov* boy", "lover*")
))
tokens("Some employees are hardworking employees in useful employment.
They support loved osuch as their wives and lovers.") %>%
tokens_lookup(dictionary = thedict, exclusive = FALSE, capkeys = FALSE)
## tokens from 1 document.
## text1 :
## [1] "Some" "employee" "are" "employee" "in" "useful" "employee"
## [8] "." "They" "support" "loved" "osuch" "as" "their"
## [15] "wives" "and" "lover" "."
我想把每个词都干掉。例如,“hardworking employees
”应转换为“hardwork employee
”而不是“hardworking employee
”。简而言之,它应该将两个词分开。我知道这没有意义。但就是这样的例子。实际上,我有一些医学词汇,其中这种词干提取是有意义的。
我有使用定界符 ',' 考虑单词然后执行词干提取的功能。我希望对其进行修改,以便可以对 ',' delimeter.
内的所有单词执行词干提取dt = read.table(header = TRUE,
text ="Word Synonyms
employee 'hardworking employees, intelligent employees, employment, employee'
lover 'loved ones, loving boy, lover'
", stringsAsFactors= F)
library(SnowballC)
library(parallel)
stem_text3<- function(text, language = "english", mc.cores = 3) {
stem_string <- function(str, language) {
str <- strsplit(x = str, split = "\,")
str <- wordStem(unlist(str), language = language)
str <- paste(str, collapse = ",")
return(str)
}
# stem each text block in turn
x <- mclapply(X = text, FUN = stem_string, language)
# return stemed text blocks
return(unlist(x))
}
df000 <- data.frame(stringsAsFactors = F)
for (i in 1:nrow(dt)){
sent = dt[i, "Synonyms"]
k = data.frame(r_synonyms = stem_text3(sent, language = 'en'), stringsAsFactors = F)
df000= rbind(df000,k)
}
这很棘手,因为 SnowballC::wordStem()
对字符向量的每个元素进行词干处理,因此您的字符向量需要拆分并重新组合才能使用。
我会放弃循环并使用应用操作对其进行矢量化(您可以将其交换为 mclapply()
。
library("stringi")
dt[["Synonyms"]] <-
sapply(stri_split_fixed(dt[["Synonyms"]], ","), function(x) {
x <- lapply(stri_split_fixed(stri_trim_both(x), " "), function(y) {
paste(SnowballC::wordStem(y), collapse = " ")
})
paste(x, collapse = ", ")
})
dt
## Word Synonyms
## 1 employee hardwork employe, intellig employe, employ, employe
## 2 lover love on, love boi, lover
备注:
首先,我认为这不是您所期望的词干,但这就是 Porter 词干分析器在 SnowballC.
中实现的工作方式其次,有更好的方法来整体构建这个问题,但我无法真正回答这个问题,除非你在问这个问题时解释你的 objective。要替换一组短语(用通配符代替词干提取),例如,在 quanteda 中,您可以执行以下操作:
library("quanteda")
thedict <- dictionary(list(
employee = c("hardwork* employ*", "intellig* employ*", "employment", "employee*"),
lover = c("lov* ones", "lov* boy", "lover*")
))
tokens("Some employees are hardworking employees in useful employment.
They support loved osuch as their wives and lovers.") %>%
tokens_lookup(dictionary = thedict, exclusive = FALSE, capkeys = FALSE)
## tokens from 1 document.
## text1 :
## [1] "Some" "employee" "are" "employee" "in" "useful" "employee"
## [8] "." "They" "support" "loved" "osuch" "as" "their"
## [15] "wives" "and" "lover" "."