在 R library(tm) 中，我如何获得带下划线的 NGRAMS 输出

Question

下面是我的代码，我在其中从文本数据创建二元语法。我得到的输出很好，只是我需要字段名称带有下划线，以便我可以将它们用作模型的变量。

text<- c("Since I love to travel, this is what I rely on every time.", 
        "I got the rewards card for the no international transaction fee", 
        "I got the rewards card mainly for the flight perks",
        "Very good card, easy application process, and no international 
transaction fee",
        "The customer service is outstanding!",
        "My wife got the rewards card for the gift cards and international 
transaction fee.She loves it") 
df<- data.frame(text) 


library(tm)
corpus<- Corpus(DataframeSource(df))
corpus<- tm_map(corpus, content_transformer(tolower))
corpus<- tm_map(corpus, removePunctuation)
corpus<- tm_map(corpus, removeWords, stopwords("english"))
corpus<- tm_map(corpus, stripWhitespace)


BigramTokenizer<-
  function(x)
    unlist(lapply(ngrams(words(x),2),paste,collapse=" "),use.names=FALSE)

dtm<- DocumentTermMatrix(corpus, control= list(tokenize= BigramTokenizer))

sparse<- removeSparseTerms(dtm,.80)
dtm2<- as.matrix(sparse)
dtm2

输出结果如下：

    Terms
Docs got rewards international transaction rewards card transaction fee
   1           0                         0            0               0
   2           1                         1            1               1
   3           1                         0            1               0
   4           0                         1            0               1
   5           0                         0            0               0
   6           1                         1            1               0

如何使字段名称像 got_rewards 而不是 got rewards

Answer 1

我猜这不是一个真正的 tm 具体问题。无论如何，您可以在代码中设置 collapse="_" 或在事后修改列名，如下所示：

colnames(dtm2) <- gsub(" ", "_", colnames(dtm2), fixed = TRUE)
dtm2
    Terms
Docs got_rewards international_transaction rewards_card transaction_fee
   1           0                         0            0               0
   2           1                         1            1               1
   3           1                         0            1               0
   4           0                         1            0               1
   5           0                         0            0               0
   6           1                         1            1               0

在 R library(tm) 中，我如何获得带下划线的 NGRAMS 输出

In R library(tm) how do I do I get the NGRAMS output with an underscore

nlp

r

text-analysis

tm