计算字符串中字母出现的频率和TF-IDF
Calculate Frequency of Letter in String and TF-IDF
我有一个名为 y
的命名字符向量,看起来与此类似 -
D1 D2 D3 D4 D5
"X D X " "G U V " "F Q " "A C U E" "H I T "
我想用这个向量做的是创建字母的频率计数和 IDF 权重。我试过 运行 这个代码:
dd <- Corpus(VectorSource(y)) #Make a corpus object from a text vector
dtm <- DocumentTermMatrix(dd, control = list(weighting = weightTfIdf))
但是,当我 运行 这段代码时,我得到了错误:
Warning message:
In weighting(x) : empty document(s): 1 2 3 4 5.
所有文件都有字母,或者至少有白色 space(我也想包括在内)。我不确定我做错了什么 - 我能够让这个例子工作 - Different tf-idf values in R and hand calculation.
使用我上面的示例,我想要的是这样的:
A C D E F G H I Q T U V X
0 0 1 0 0 0 0 0 0 0 0 0 2 - D1
0 0 0 0 0 1 0 0 0 0 1 1 0 - D2
...
如有任何帮助,我们将不胜感激!
你可以在 base R 中完成:
y <- c("X D X ", "G U V ", "F Q ", "A C U E", "H I T ")
names(y) <- paste0("D", 1:5)
# named vector of strings
y
D1 D2 D3 D4 D5
"X D X " "G U V " "F Q " "A C U E" "H I T "
# get rid of spaces, then split every letter
let <- sapply(y, function(x) strsplit(gsub(" ", "", x), ""))
# all possible letters
let_all <- unique(unlist(let))
# uses table on factored x with all possible levels
let_tab <- sapply(let, function(x) table(factor(x, levels=let_all)))
# with some cosmetics
t(let_tab[order(rownames(let_tab)), ])
A C D E F G H I Q T U V X
D1 0 0 1 0 0 0 0 0 0 0 0 0 2
D2 0 0 0 0 0 1 0 0 0 0 1 1 0
D3 0 0 0 0 1 0 0 0 1 0 0 0 0
D4 1 1 0 1 0 0 0 0 0 0 1 0 0
D5 0 0 0 0 0 0 1 1 0 1 0 0 0
这是你想要的吗?如果是,一个完成所有这些的函数:
tabulate_letters <- function(y){
let <- sapply(y, function(x) strsplit(gsub(" ", "", x), ""))
# all possible letters
let_all <- unique(unlist(let))
# uses table on factored x with all possible levels
let_tab <- sapply(let, function(x) table(factor(x, levels=let_all)))
# with some cosmetics
t(let_tab[order(rownames(let_tab)), ])
}
tabulate_letters(y)
A C D E F G H I Q T U V X
D1 0 0 1 0 0 0 0 0 0 0 0 0 2
D2 0 0 0 0 0 1 0 0 0 0 1 1 0
D3 0 0 0 0 1 0 0 0 1 0 0 0 0
D4 1 1 0 1 0 0 0 0 0 0 1 0 0
D5 0 0 0 0 0 0 1 1 0 1 0 0 0
我们也可以使用 qdapTools
中的 mtabulate
来做到这一点
library(qdapTools)
mtabulate(strsplit(y, ' '))[-1]
# A C D E F G H I Q T U V X
#D1 0 0 1 0 0 0 0 0 0 0 0 0 2
#D2 0 0 0 0 0 1 0 0 0 0 1 1 0
#D3 0 0 0 0 1 0 0 0 1 0 0 0 0
#D4 1 1 0 1 0 0 0 0 0 0 1 0 0
#D5 0 0 0 0 0 0 1 1 0 1 0 0 0
我们可以在执行 strsplit
.
之前用 trimws
删除 leading/lagging 空格
mtabulate(strsplit(trimws(y), " "))
数据
y <- c("X D X ", "G U V ", "F Q ", "A C U E", "H I T ")
names(y) <- paste0("D", 1:5)
有一个应用程序:quanteda 包。
require(quanteda)
y <- c("X D X ", "G U V ", "F Q ", "A C U E", "H I T ")
dtm <- dfm(y, toLower = FALSE, verbose = FALSE)
# sort by letter, if that's important
dtm <- dtm[, sort(features(dtm))]
dtm
## Document-feature matrix of: 5 documents, 13 features.
## 5 x 13 sparse Matrix of class "dfmSparse"
## features
## docs A C D E F G H I Q T U V X
## text1 0 0 1 0 0 0 0 0 0 0 0 0 2
## text2 0 0 0 0 0 1 0 0 0 0 1 1 0
## text3 0 0 0 0 1 0 0 0 1 0 0 0 0
## text4 1 1 0 1 0 0 0 0 0 0 1 0 0
## text5 0 0 0 0 0 0 1 1 0 1 0 0 0
如果你想要tf-idf,那也很简单:
tfidf(dtm)
## Document-feature matrix of: 5 documents, 13 features.
## 5 x 13 sparse Matrix of class "dfmSparse"
## features
## docs A C D E F G H I Q T U V X
## text1 0 0 0.69897 0 0 0 0 0 0 0 0 0 1.39794
## text2 0 0 0 0 0 0.69897 0 0 0 0 0.39794 0.69897 0
## text3 0 0 0 0 0.69897 0 0 0 0.69897 0 0 0 0
## text4 0.69897 0.69897 0 0.69897 0 0 0 0 0 0 0.39794 0 0
## text5 0 0 0 0 0 0 0.69897 0.69897 0 0.69897 0 0 0
我有一个名为 y
的命名字符向量,看起来与此类似 -
D1 D2 D3 D4 D5
"X D X " "G U V " "F Q " "A C U E" "H I T "
我想用这个向量做的是创建字母的频率计数和 IDF 权重。我试过 运行 这个代码:
dd <- Corpus(VectorSource(y)) #Make a corpus object from a text vector
dtm <- DocumentTermMatrix(dd, control = list(weighting = weightTfIdf))
但是,当我 运行 这段代码时,我得到了错误:
Warning message:
In weighting(x) : empty document(s): 1 2 3 4 5.
所有文件都有字母,或者至少有白色 space(我也想包括在内)。我不确定我做错了什么 - 我能够让这个例子工作 - Different tf-idf values in R and hand calculation.
使用我上面的示例,我想要的是这样的:
A C D E F G H I Q T U V X
0 0 1 0 0 0 0 0 0 0 0 0 2 - D1
0 0 0 0 0 1 0 0 0 0 1 1 0 - D2
...
如有任何帮助,我们将不胜感激!
你可以在 base R 中完成:
y <- c("X D X ", "G U V ", "F Q ", "A C U E", "H I T ")
names(y) <- paste0("D", 1:5)
# named vector of strings
y
D1 D2 D3 D4 D5
"X D X " "G U V " "F Q " "A C U E" "H I T "
# get rid of spaces, then split every letter
let <- sapply(y, function(x) strsplit(gsub(" ", "", x), ""))
# all possible letters
let_all <- unique(unlist(let))
# uses table on factored x with all possible levels
let_tab <- sapply(let, function(x) table(factor(x, levels=let_all)))
# with some cosmetics
t(let_tab[order(rownames(let_tab)), ])
A C D E F G H I Q T U V X
D1 0 0 1 0 0 0 0 0 0 0 0 0 2
D2 0 0 0 0 0 1 0 0 0 0 1 1 0
D3 0 0 0 0 1 0 0 0 1 0 0 0 0
D4 1 1 0 1 0 0 0 0 0 0 1 0 0
D5 0 0 0 0 0 0 1 1 0 1 0 0 0
这是你想要的吗?如果是,一个完成所有这些的函数:
tabulate_letters <- function(y){
let <- sapply(y, function(x) strsplit(gsub(" ", "", x), ""))
# all possible letters
let_all <- unique(unlist(let))
# uses table on factored x with all possible levels
let_tab <- sapply(let, function(x) table(factor(x, levels=let_all)))
# with some cosmetics
t(let_tab[order(rownames(let_tab)), ])
}
tabulate_letters(y)
A C D E F G H I Q T U V X
D1 0 0 1 0 0 0 0 0 0 0 0 0 2
D2 0 0 0 0 0 1 0 0 0 0 1 1 0
D3 0 0 0 0 1 0 0 0 1 0 0 0 0
D4 1 1 0 1 0 0 0 0 0 0 1 0 0
D5 0 0 0 0 0 0 1 1 0 1 0 0 0
我们也可以使用 qdapTools
mtabulate
来做到这一点
library(qdapTools)
mtabulate(strsplit(y, ' '))[-1]
# A C D E F G H I Q T U V X
#D1 0 0 1 0 0 0 0 0 0 0 0 0 2
#D2 0 0 0 0 0 1 0 0 0 0 1 1 0
#D3 0 0 0 0 1 0 0 0 1 0 0 0 0
#D4 1 1 0 1 0 0 0 0 0 0 1 0 0
#D5 0 0 0 0 0 0 1 1 0 1 0 0 0
我们可以在执行 strsplit
.
trimws
删除 leading/lagging 空格
mtabulate(strsplit(trimws(y), " "))
数据
y <- c("X D X ", "G U V ", "F Q ", "A C U E", "H I T ")
names(y) <- paste0("D", 1:5)
有一个应用程序:quanteda 包。
require(quanteda)
y <- c("X D X ", "G U V ", "F Q ", "A C U E", "H I T ")
dtm <- dfm(y, toLower = FALSE, verbose = FALSE)
# sort by letter, if that's important
dtm <- dtm[, sort(features(dtm))]
dtm
## Document-feature matrix of: 5 documents, 13 features.
## 5 x 13 sparse Matrix of class "dfmSparse"
## features
## docs A C D E F G H I Q T U V X
## text1 0 0 1 0 0 0 0 0 0 0 0 0 2
## text2 0 0 0 0 0 1 0 0 0 0 1 1 0
## text3 0 0 0 0 1 0 0 0 1 0 0 0 0
## text4 1 1 0 1 0 0 0 0 0 0 1 0 0
## text5 0 0 0 0 0 0 1 1 0 1 0 0 0
如果你想要tf-idf,那也很简单:
tfidf(dtm)
## Document-feature matrix of: 5 documents, 13 features.
## 5 x 13 sparse Matrix of class "dfmSparse"
## features
## docs A C D E F G H I Q T U V X
## text1 0 0 0.69897 0 0 0 0 0 0 0 0 0 1.39794
## text2 0 0 0 0 0 0.69897 0 0 0 0 0.39794 0.69897 0
## text3 0 0 0 0 0.69897 0 0 0 0.69897 0 0 0 0
## text4 0.69897 0.69897 0 0.69897 0 0 0 0 0 0 0.39794 0 0
## text5 0 0 0 0 0 0 0.69897 0.69897 0 0.69897 0 0 0