如何从 TDM 矩阵中分离出二元组的数字向量
How to split out numeric vector of bigrams from TDM matrix
我在 R 中有一个大数字(46201 个元素,3.3 Mb)。
tdm_pairs.matrix <- as.matrix(tdm_pairs)
top_pairs <- colSums(tdm_pairs.matrix)
head(sort(top_pairs, decreasing = T),2)
i know i dont i think i can i just i want
46 42 41 31 30 28
我试过这个来拆分每个:
unlist(strsplit(as.character(top_pairs)," "))
"46" "42" "41" "31" "30" "28"
我希望将其中的每一个分开,所以输出将类似于:
"i" "know" "46"
"i" "dont" "42"
是这样的吗?
> top_pairs <- structure(c(46, 42), .Names = c("i know", "i dont"))
> do.call(rbind, strsplit(paste(names(top_pairs), top_pairs), " "))
[,1] [,2] [,3]
[1,] "i" "know" "46"
[2,] "i" "dont" "42"
或者如果您想保留数值,您可以使用 tidyr
:
转换为数据框
> library(magrittr)
> library(tidyr)
> data.frame(names=names(top_pairs), count=top_pairs) %>%
separate(names, into=c("name1", "name2"), sep=" ") %>%
set_rownames(NULL)
name1 name2 count
1 i know 46
2 i dont 42
由于您的文件很大,您可能需要使用 stringi
library(stringi)
data.frame(stri_split_fixed(names(top_pairs), " ", simplify=T),
count=top_pairs, row.names=seq_along(top_pairs))
# X1 X2 count
# 1 i know 46
# 2 i dont 42
我在 R 中有一个大数字(46201 个元素,3.3 Mb)。
tdm_pairs.matrix <- as.matrix(tdm_pairs)
top_pairs <- colSums(tdm_pairs.matrix)
head(sort(top_pairs, decreasing = T),2)
i know i dont i think i can i just i want
46 42 41 31 30 28
我试过这个来拆分每个:
unlist(strsplit(as.character(top_pairs)," "))
"46" "42" "41" "31" "30" "28"
我希望将其中的每一个分开,所以输出将类似于:
"i" "know" "46"
"i" "dont" "42"
是这样的吗?
> top_pairs <- structure(c(46, 42), .Names = c("i know", "i dont"))
> do.call(rbind, strsplit(paste(names(top_pairs), top_pairs), " "))
[,1] [,2] [,3]
[1,] "i" "know" "46"
[2,] "i" "dont" "42"
或者如果您想保留数值,您可以使用 tidyr
:
> library(magrittr)
> library(tidyr)
> data.frame(names=names(top_pairs), count=top_pairs) %>%
separate(names, into=c("name1", "name2"), sep=" ") %>%
set_rownames(NULL)
name1 name2 count
1 i know 46
2 i dont 42
由于您的文件很大,您可能需要使用 stringi
library(stringi)
data.frame(stri_split_fixed(names(top_pairs), " ", simplify=T),
count=top_pairs, row.names=seq_along(top_pairs))
# X1 X2 count
# 1 i know 46
# 2 i dont 42