词干化每个词

Question

我想把每个词都干掉。例如，“hardworking employees”应转换为“hardwork employee”而不是“hardworking employee”。简而言之，它应该将两个词分开。我知道这没有意义。但就是这样的例子。实际上，我有一些医学词汇，其中这种词干提取是有意义的。

我有使用定界符 ',' 考虑单词然后执行词干提取的功能。我希望对其进行修改，以便可以对 ',' delimeter.

内的所有单词执行词干提取

dt = read.table(header = TRUE, 
text ="Word Synonyms
employee 'hardworking employees, intelligent employees, employment, employee'
lover  'loved ones, loving boy, lover'
", stringsAsFactors= F)

library(SnowballC)
library(parallel)

stem_text3<- function(text, language = "english", mc.cores = 3) {
  stem_string <- function(str, language) {
    str <- strsplit(x = str, split = "\,")
    str <- wordStem(unlist(str), language = language)
    str <- paste(str, collapse = ",")
    return(str)
  }

  # stem each text block in turn
  x <- mclapply(X = text, FUN = stem_string, language)

  # return stemed text blocks
  return(unlist(x))
}

df000 <- data.frame(stringsAsFactors = F)
for (i in 1:nrow(dt)){
  sent = dt[i, "Synonyms"]
  k = data.frame(r_synonyms = stem_text3(sent, language = 'en'), stringsAsFactors = F)
  df000= rbind(df000,k)
}

Answer 1

这很棘手，因为 SnowballC::wordStem() 对字符向量的每个元素进行词干处理，因此您的字符向量需要拆分并重新组合才能使用。

我会放弃循环并使用应用操作对其进行矢量化（您可以将其交换为 mclapply()。

library("stringi")
dt[["Synonyms"]] <- 
    sapply(stri_split_fixed(dt[["Synonyms"]], ","), function(x) {
        x <- lapply(stri_split_fixed(stri_trim_both(x), " "), function(y) {
            paste(SnowballC::wordStem(y), collapse = " ")
        })
        paste(x, collapse = ", ")
    })

dt
##       Word                                            Synonyms
## 1 employee hardwork employe, intellig employe, employ, employe
## 2    lover                            love on, love boi, lover

备注：

首先，我认为这不是您所期望的词干，但这就是 Porter 词干分析器在 SnowballC.

中实现的工作方式

其次，有更好的方法来整体构建这个问题，但我无法真正回答这个问题，除非你在问这个问题时解释你的 objective。要替换一组短语（用通配符代替词干提取），例如，在 quanteda 中，您可以执行以下操作：

library("quanteda")
thedict <- dictionary(list(
    employee = c("hardwork* employ*", "intellig* employ*", "employment", "employee*"),
    lover = c("lov* ones", "lov* boy", "lover*")
))

tokens("Some employees are hardworking employees in useful employment.  
        They support loved osuch as their wives and lovers.") %>%
    tokens_lookup(dictionary = thedict, exclusive = FALSE, capkeys = FALSE)
## tokens from 1 document.
## text1 :
##  [1] "Some"     "employee" "are"      "employee" "in"       "useful"   "employee"
##  [8] "."        "They"     "support"  "loved"    "osuch"    "as"       "their"   
## [15] "wives"    "and"      "lover"    "."

词干化每个词

Stemming each word

r

tm

quanteda