在 R library(tm) 中,我如何获得带下划线的 NGRAMS 输出
In R library(tm) how do I do I get the NGRAMS output with an underscore
下面是我的代码,我在其中从文本数据创建二元语法。我得到的输出很好,只是我需要字段名称带有下划线,以便我可以将它们用作模型的变量。
text<- c("Since I love to travel, this is what I rely on every time.",
"I got the rewards card for the no international transaction fee",
"I got the rewards card mainly for the flight perks",
"Very good card, easy application process, and no international
transaction fee",
"The customer service is outstanding!",
"My wife got the rewards card for the gift cards and international
transaction fee.She loves it")
df<- data.frame(text)
library(tm)
corpus<- Corpus(DataframeSource(df))
corpus<- tm_map(corpus, content_transformer(tolower))
corpus<- tm_map(corpus, removePunctuation)
corpus<- tm_map(corpus, removeWords, stopwords("english"))
corpus<- tm_map(corpus, stripWhitespace)
BigramTokenizer<-
function(x)
unlist(lapply(ngrams(words(x),2),paste,collapse=" "),use.names=FALSE)
dtm<- DocumentTermMatrix(corpus, control= list(tokenize= BigramTokenizer))
sparse<- removeSparseTerms(dtm,.80)
dtm2<- as.matrix(sparse)
dtm2
输出结果如下:
Terms
Docs got rewards international transaction rewards card transaction fee
1 0 0 0 0
2 1 1 1 1
3 1 0 1 0
4 0 1 0 1
5 0 0 0 0
6 1 1 1 0
如何使字段名称像 got_rewards 而不是 got rewards
我猜这不是一个真正的 tm
具体问题。无论如何,您可以在代码中设置 collapse="_"
或在事后修改列名,如下所示:
colnames(dtm2) <- gsub(" ", "_", colnames(dtm2), fixed = TRUE)
dtm2
Terms
Docs got_rewards international_transaction rewards_card transaction_fee
1 0 0 0 0
2 1 1 1 1
3 1 0 1 0
4 0 1 0 1
5 0 0 0 0
6 1 1 1 0
下面是我的代码,我在其中从文本数据创建二元语法。我得到的输出很好,只是我需要字段名称带有下划线,以便我可以将它们用作模型的变量。
text<- c("Since I love to travel, this is what I rely on every time.",
"I got the rewards card for the no international transaction fee",
"I got the rewards card mainly for the flight perks",
"Very good card, easy application process, and no international
transaction fee",
"The customer service is outstanding!",
"My wife got the rewards card for the gift cards and international
transaction fee.She loves it")
df<- data.frame(text)
library(tm)
corpus<- Corpus(DataframeSource(df))
corpus<- tm_map(corpus, content_transformer(tolower))
corpus<- tm_map(corpus, removePunctuation)
corpus<- tm_map(corpus, removeWords, stopwords("english"))
corpus<- tm_map(corpus, stripWhitespace)
BigramTokenizer<-
function(x)
unlist(lapply(ngrams(words(x),2),paste,collapse=" "),use.names=FALSE)
dtm<- DocumentTermMatrix(corpus, control= list(tokenize= BigramTokenizer))
sparse<- removeSparseTerms(dtm,.80)
dtm2<- as.matrix(sparse)
dtm2
输出结果如下:
Terms
Docs got rewards international transaction rewards card transaction fee
1 0 0 0 0
2 1 1 1 1
3 1 0 1 0
4 0 1 0 1
5 0 0 0 0
6 1 1 1 0
如何使字段名称像 got_rewards 而不是 got rewards
我猜这不是一个真正的 tm
具体问题。无论如何,您可以在代码中设置 collapse="_"
或在事后修改列名,如下所示:
colnames(dtm2) <- gsub(" ", "_", colnames(dtm2), fixed = TRUE)
dtm2
Terms
Docs got_rewards international_transaction rewards_card transaction_fee
1 0 0 0 0
2 1 1 1 1
3 1 0 1 0
4 0 1 0 1
5 0 0 0 0
6 1 1 1 0