如何在 R 中将文本拆分为两个有意义的词

How to split a text into two meaningful words in R

这是我的数据框 df 中的文本,其中有一个名为 'problem_note_text'

的文本列

SSCIssue: Note Dispenser Failureperformed checks / dispensor failure / asked the stores to take the note dispensor out and set it back / still error message says front door is open / hence CE attn reqContact details - Olivia taber 01159063390 / 7am-11pm

df$problem_note_text <- tolower(df$problem_note_text)
df$problem_note_text <- tm::removeNumbers(df$problem_note_text)
df$problem_note_text<- str_replace_all(df$problem_note_text, "  ", "") # replace double spaces with single space
df$problem_note_text = str_replace_all(df$problem_note_text, pattern = "[[:punct:]]", " ")
df$problem_note_text<- tm::removeWords(x = df$problem_note_text, stopwords(kind = 'english'))
Words = all_words(df$problem_note_text, begins.with=NULL)

现在有一个包含单词列表的数据框,但有像

这样的单词

"Failureperformed"

需要拆分成两个有意义的词,例如

"Failure" "performed".

我该怎么做,数据框也包含

这样的词

"im" , "h"

没有意义,必须删除,我不知道如何实现。

给定一个英语单词列表,您可以通过查找列表中每个可能的单词拆分来非常简单地完成此操作。我将使用我为单词列表找到的第一个 Google 命中,其中包含大约 70k 个小写单词:

wl <- read.table("http://www-personal.umich.edu/~jlawler/wordlist")$V1

check.word <- function(x, wl) {
  x <- tolower(x)
  nc <- nchar(x)
  parts <- sapply(1:(nc-1), function(y) c(substr(x, 1, y), substr(x, y+1, nc)))
  parts[,parts[1,] %in% wl & parts[2,] %in% wl]
}

这有时有效:

check.word("screenunable", wl)
# [1] "screen" "unable"
check.word("nowhere", wl)
#      [,1]    [,2]  
# [1,] "no"    "now" 
# [2,] "where" "here"

但当相关单词不在单词列表中时有时也会失败(在这种情况下 "sensor" 缺失):

check.word("sensoradvise", wl)
#     
# [1,]
# [2,]
"sensor" %in% wl
# [1] FALSE
"advise" %in% wl
# [1] TRUE