为什么 ngrams() 函数给出不同的双字母组?

Why does the ngrams() function give distinct bigrams?

我正在编写 R 脚本并正在使用库 (ngram)。

假设我有一个字符串,

"good qualiti dog food bought sever vital can dog food product found good qualiti product look like stew process meat smell better labrador finicki appreci product better"

并想找到二元语法。

ngram 库给我的二元语法如下:

"appreci product""process meat""food product""food bought""qualiti dog""product found""product look""look like""like stew" "good qualiti" "labrador finicki" "bought sever" "qualiti product" "better labrador" "dog food" "smell better" "vital can" "meat smell" "found good" "sever vital" "stew process" "can dog" "finicki appreci""product better"

由于句子包含两次"dog food",所以我要两次这个二元语法。但是我得到了一次!

thengram 库或任何其他库中是否有一个选项可以在 R 中给出我的句子的所有二元语法?

您可以使用 stylo 包。给出重复项:

library(stylo)
a = "good qualiti dog food bought sever vital can dog food product found good qualiti product look like stew process meat smell better labrador finicki appreci product better"
b = txt.to.words(a)
c = make.ngrams(b, ngram.size = 2)
print(c)

结果:

 [1] "good qualiti"     "qualiti dog"      "dog food"         "food bought"      "bought sever"     "sever vital"      "vital can"        "can dog"          "dog food"        
[10] "food product"     "product found"    "found good"       "good qualiti"     "qualiti product"  "product look"     "look like"        "like stew"        "stew process"    
[19] "process meat"     "meat smell"       "smell better"     "better labrador"  "labrador finicki" "finicki appreci"  "appreci product"  "product better"  
>

开发版ngram有一个get.phrasetable方法:

devtools::install_github("wrathematics/ngram")
library(ngram)

text <- "good qualiti dog food bought sever vital can dog food product found good qualiti product look like stew process meat smell better labrador finicki appreci product better"

ng <- ngram(text)
head(get.phrasetable(ng))
#            ngrams freq       prop
# 1    good qualiti    2 0.07692308
# 2        dog food    2 0.07692308
# 3 appreci product    1 0.03846154
# 4    process meat    1 0.03846154
# 5    food product    1 0.03846154
# 6     food bought    1 0.03846154

此外,您可以使用print()方法并指定output == "full"。即:

print(ng, output = "full")

# NOTE: more output not shown...
better labrador | 1 
finicki {1} | 

dog food | 2 
product {1} | bought {1} 
# NOTE: more output not shown...

你可以使用 RWeka。在结果中你可以看到 "dog food" 和 "good qualiti" 出现了两次

txt <- "good qualiti dog food bought sever vital can dog food product found good qualiti product look like stew process meat smell better labrador finicki appreci product better"


library(RWeka)
RWEKABigramTokenizer <- function(x) {
      NGramTokenizer(x, Weka_control(min = 2, max = 2)) 
}

RWEKABigramTokenizer(txt)

 [1] "good qualiti"     "qualiti dog"      "dog food"         "food bought"      "bought sever"     "sever vital"      "vital can"       
 [8] "can dog"          "dog food"         "food product"     "product found"    "found good"       "good qualiti"     "qualiti product" 
[15] "product look"     "look like"        "like stew"        "stew process"     "process meat"     "meat smell"       "smell better"    
[22] "better labrador"  "labrador finicki" "finicki appreci"  "appreci product"  "product better"  

或者结合 RWeka 使用 tm 包

library(tm)
library(RWeka)
my_corp <- Corpus(VectorSource(txt))
tdm_RWEKA <- TermDocumentMatrix(my_corp, control=list(tokenize = RWEKABigramTokenizer))

#show the 2 bigrams
findFreqTerms(tdm_RWEKA, lowfreq = 2)

[1] "dog food"     "good qualiti"

#turn into matrix with frequency counts
tdm_matrix <- as.matrix(tdm_RWEKA)

为了产生这样的二元语法,你不需要任何特殊的包。基本上,将文本拆分并再次粘贴在一起。

t <- "good qualiti dog food bought sever vital can dog food product found good qualiti product look like stew process meat smell better labrador finicki appreci product better"
ug <- strsplit(t, " ")[[1]]
bg <- paste(ug, ug[2:length(ug)])

结果 bg 将是:

[1] "good qualiti"     "qualiti dog"      "dog food"
[4] "food bought"      "bought sever"     "sever vital"
[7] "vital can"        "can dog"          "dog food"
[10] "food product"     "product found"    "found good"
[13] "good qualiti"     "qualiti product"  "product look"
[16] "look like"        "like stew"        "stew process"
[19] "process meat"     "meat smell"       "smell better"
[22] "better labrador"  "labrador finicki" "finicki appreci"
[25] "appreci product"  "product better"   "better qualiti" 

尝试 quanteda 软件包:

> quanteda::tokenize(txt, ngrams = 2, concatenator = " ")
[[1]]
 [1] "good qualiti"     "qualiti dog"      "dog food"         "food bought"      "bought sever"     "sever vital"     
 [7] "vital can"        "can dog"          "dog food"         "food product"     "product found"    "found good"      
[13] "good qualiti"     "qualiti product"  "product look"     "look like"        "like stew"        "stew process"    
[19] "process meat"     "meat smell"       "smell better"     "better labrador"  "labrador finicki" "finicki appreci" 
[25] "appreci product"  "product better"  

通过 ngrams 可以获得大量额外的参数,包括获得 n 大小的不同组合,或 skip-grams。