R:从 Quanteda DFM、稀疏文档特征矩阵、对象中删除正则表达式?
R: removal of regex from Quanteda DFM, Sparse Document-Feature Matrix, object?
Quanteda 包提供稀疏文档特征矩阵 DFM,其方法包含 removeFeatures。我已经尝试 dfm(x, removeFeatures="\b[a-z]{1-3}\b")
删除太短的单词以及 dfm(x, keptFeatures="\b[a-z]{4-99}\b")
保留足够长的单词但不起作用,基本上做同样的事情,即删除太短的单词。
如何从 Quanteda DFM 对象中删除正则表达式匹配项?
示例。
myMatrix <-dfm(myData, ignoredFeatures = stopwords("english"),
stem = TRUE, toLower = TRUE, removeNumbers = TRUE,
removePunct = TRUE, removeSeparators = TRUE, language = "english")
#
#How to use keptFeatures/removeFeatures here?
#Instead of RemoveFeatures/keptFeatures methods, I tried it like this but not working
x<-unique(gsub("\b[a-zA-Z0-9]{1,3}\b", "", colnames(myMatrix)));
x<-x[x!=""];
mmyMatrix<-myMatrix;
colnames(mmyMatrix) <- x
示例 DFM
myData <- c("a aothu oat hoah huh huh huhhh h h h n", "hello h a b c d abc abcde", "hello hallo hei hej", "Hello my name is hhh.")
myMatrix <- dfm(myData)
它是 dfm_select
,在 >= v0.9.9:
myMatrix
## Document-feature matrix of: 4 documents, 22 features (70.5% sparse).
dfm_select(myMatrix, "\b[a-zA-Z0-9]{1,3}\b", selection = "keep", valuetype = "regex")
## kept 14 features, from 1 supplied (regex) feature types
## Document-feature matrix of: 4 documents, 14 features (71.4% sparse).
## 4 x 14 sparse Matrix of class "dfmSparse"
## features
## docs a oat huh h n b c d abc hei hej my is hhh
## text1 1 1 2 3 1 0 0 0 0 0 0 0 0 0
## text2 1 0 0 1 0 1 1 1 1 0 0 0 0 0
## text3 0 0 0 0 0 0 0 0 0 1 1 0 0 0
## text4 0 0 0 0 0 0 0 0 0 0 0 1 1 1
Quanteda 包提供稀疏文档特征矩阵 DFM,其方法包含 removeFeatures。我已经尝试 dfm(x, removeFeatures="\b[a-z]{1-3}\b")
删除太短的单词以及 dfm(x, keptFeatures="\b[a-z]{4-99}\b")
保留足够长的单词但不起作用,基本上做同样的事情,即删除太短的单词。
如何从 Quanteda DFM 对象中删除正则表达式匹配项?
示例。
myMatrix <-dfm(myData, ignoredFeatures = stopwords("english"),
stem = TRUE, toLower = TRUE, removeNumbers = TRUE,
removePunct = TRUE, removeSeparators = TRUE, language = "english")
#
#How to use keptFeatures/removeFeatures here?
#Instead of RemoveFeatures/keptFeatures methods, I tried it like this but not working
x<-unique(gsub("\b[a-zA-Z0-9]{1,3}\b", "", colnames(myMatrix)));
x<-x[x!=""];
mmyMatrix<-myMatrix;
colnames(mmyMatrix) <- x
示例 DFM
myData <- c("a aothu oat hoah huh huh huhhh h h h n", "hello h a b c d abc abcde", "hello hallo hei hej", "Hello my name is hhh.")
myMatrix <- dfm(myData)
它是 dfm_select
,在 >= v0.9.9:
myMatrix
## Document-feature matrix of: 4 documents, 22 features (70.5% sparse).
dfm_select(myMatrix, "\b[a-zA-Z0-9]{1,3}\b", selection = "keep", valuetype = "regex")
## kept 14 features, from 1 supplied (regex) feature types
## Document-feature matrix of: 4 documents, 14 features (71.4% sparse).
## 4 x 14 sparse Matrix of class "dfmSparse"
## features
## docs a oat huh h n b c d abc hei hej my is hhh
## text1 1 1 2 3 1 0 0 0 0 0 0 0 0 0
## text2 1 0 0 1 0 1 1 1 1 0 0 0 0 0
## text3 0 0 0 0 0 0 0 0 0 1 1 0 0 0
## text4 0 0 0 0 0 0 0 0 0 0 0 1 1 1