在文档特征矩阵中拆分 ngrams (quanteda)
Split up ngrams in document-feature matrix (quanteda)
我想知道是否可以在文档特征矩阵 (dfm) 中拆分 ngram 特征,例如一个双字母组会产生两个单独的单字母组?
head(dfm, n = 3, nfeature = 4)
docs in_the great plenary emission_reduction
10752099 3 1 1 3
10165509 8 0 0 3
10479890 4 0 0 1
所以,上面的 dfm 会导致这样的结果:
head(dfm, n = 3, nfeature = 4)
docs in great plenary emission the reduction
10752099 3 1 1 3 3 3
10165509 8 0 0 3 8 3
10479890 4 0 0 1 4 1
为了更好地理解:我通过将功能从德语翻译成英语而获得了 dfm 中的 ngram。复合词 ("Emissionsminderung") 在德语中很常见,但在英语中并不常见 ("emission reduction")。
提前致谢!
编辑:以下内容可用作可重现的示例。
library(quanteda)
eg.txt <- c('increase in_the great plenary',
'great plenary emission_reduction',
'increase in_the emission_reduction emission_increase')
eg.corp <- corpus(eg.txt)
eg.dfm <- dfm(eg.corp)
head(eg.dfm)
我不知道这是否是最好的方法(它可能会使用大量 RAM,因为它将稀疏 dfm
转换为 data.frame/matrix
),但它应该有效:
# turn the dft into a matrix (transposing it)
DF <- as.data.frame(eg.dfm)
MX <- t(DF)
# split the current column names by '_'
colsSplit <- strsplit(colnames(DF),'_')
# replicate the rows of the matrix and give them the new split row names
MX <-MX[unlist(lapply(1:length(colsSplit),function(idx) rep(idx,length(colsSplit[[idx]])))),]
rownames(MX) <- unlist(colsSplit)
# aggregate the matrix rows having the same name and transpose again
MX2 <- t(do.call(rbind,by(MX,rownames(MX),colSums)))
# turn the matrix into a dfm
eg.dfm.res <- as.dfm(MX2)
结果:
> eg.dfm.res
Document-feature matrix of: 3 documents, 7 features (33.3% sparse).
3 x 7 sparse Matrix of class "dfmSparse"
features
docs emission great in increase plenary reduction the
text1 0 1 1 1 1 0 1
text2 1 1 0 0 1 1 0
text3 2 0 1 2 0 1 1
我想知道是否可以在文档特征矩阵 (dfm) 中拆分 ngram 特征,例如一个双字母组会产生两个单独的单字母组?
head(dfm, n = 3, nfeature = 4)
docs in_the great plenary emission_reduction
10752099 3 1 1 3
10165509 8 0 0 3
10479890 4 0 0 1
所以,上面的 dfm 会导致这样的结果:
head(dfm, n = 3, nfeature = 4)
docs in great plenary emission the reduction
10752099 3 1 1 3 3 3
10165509 8 0 0 3 8 3
10479890 4 0 0 1 4 1
为了更好地理解:我通过将功能从德语翻译成英语而获得了 dfm 中的 ngram。复合词 ("Emissionsminderung") 在德语中很常见,但在英语中并不常见 ("emission reduction")。
提前致谢!
编辑:以下内容可用作可重现的示例。
library(quanteda)
eg.txt <- c('increase in_the great plenary',
'great plenary emission_reduction',
'increase in_the emission_reduction emission_increase')
eg.corp <- corpus(eg.txt)
eg.dfm <- dfm(eg.corp)
head(eg.dfm)
我不知道这是否是最好的方法(它可能会使用大量 RAM,因为它将稀疏 dfm
转换为 data.frame/matrix
),但它应该有效:
# turn the dft into a matrix (transposing it)
DF <- as.data.frame(eg.dfm)
MX <- t(DF)
# split the current column names by '_'
colsSplit <- strsplit(colnames(DF),'_')
# replicate the rows of the matrix and give them the new split row names
MX <-MX[unlist(lapply(1:length(colsSplit),function(idx) rep(idx,length(colsSplit[[idx]])))),]
rownames(MX) <- unlist(colsSplit)
# aggregate the matrix rows having the same name and transpose again
MX2 <- t(do.call(rbind,by(MX,rownames(MX),colSums)))
# turn the matrix into a dfm
eg.dfm.res <- as.dfm(MX2)
结果:
> eg.dfm.res
Document-feature matrix of: 3 documents, 7 features (33.3% sparse).
3 x 7 sparse Matrix of class "dfmSparse"
features
docs emission great in increase plenary reduction the
text1 0 1 1 1 1 0 1
text2 1 1 0 0 1 1 0
text3 2 0 1 2 0 1 1