如何禁止 ngrams 中的标点符号和空格?
How to forbid punctuation and whitespace inside ngrams?
我有一个这样的字符向量:
sent <- c("The quick brown fox jumps over the lazy dog.",
"Over the lazy dog jumped the quick brown fox.",
"The quick brown fox jumps over the lazy dog.")
我正在使用 textcnt()
生成二元语法,如下所示:
txt <- textcnt(sent, method = "string", split = " ", n=2, tolower = FALSE)
format(txt)
给我所有的双字母组
frq rank bytes Encoding
Over the 1 4.5 8 unknown
The quick 2 11.5 9 unknown
brown fox 2 11.5 9 unknown
brown fox. 1 4.5 10 unknown
dog jumped 1 4.5 10 unknown
dog. Over 1 4.5 9 unknown
fox jumps 2 11.5 9 unknown
fox. The 1 4.5 8 unknown
jumped the 1 4.5 10 unknown
jumps over 2 11.5 10 unknown
lazy dog 1 4.5 8 unknown
lazy dog. 2 11.5 9 unknown
over the 2 11.5 8 unknown
quick brown 3 15.5 11 unknown
the lazy 3 15.5 8 unknown
the quick 1 4.5 9 unknown
真实数据有更多的句子。我有两个问题:
1. 是否可以提及在生成的 ngram 中应截断每个句子末尾的点?
2. 是否可以防止生成跨越两个句子的 ngram? dog. Over
和 fox. The
您可以通过避免 textcnt 来避免 textcnt 中的特定 ngram。 :-) 为了充实@lukeA 的评论,这里是完整的 quanteda 解决方案。
require(quanteda)
packageVersion("quanteda")
## [1] ‘0.9.5.19’
这会将标记化为双字母组,同时删除标点符号。因为每个句子都是一个"document",双字母组永远不会跨越文档。
(bigramToks <- tokenize(sent, ngrams = 2, removePunct = TRUE, concatenator = " "))
tokenizedText object from 3 documents.
## Component 1 :
## [1] "The quick" "quick brown" "brown fox" "fox jumps" "jumps over" "over the" "the lazy" "lazy dog"
##
## Component 2 :
## [1] "Over the" "the lazy" "lazy dog" "dog jumped" "jumped the" "the quick" "quick brown" "brown fox"
##
## Component 3 :
## [1] "The quick" "quick brown" "brown fox" "fox jumps" "jumps over" "over the" "the lazy" "lazy dog"
要获得这些频率,您应该通过使用 dfm()
构建文档特征矩阵来列出二元组标记。 (注意:您可以跳过标记化步骤并直接使用 dfm(sent, ngrams = 2, toLower = FALSE, concatenator = " ")
完成此操作。)
(bigramDfm <- dfm(bigramToks, toLower = FALSE, verbose = FALSE))
## Document-feature matrix of: 3 documents, 12 features.
## 3 x 12 sparse Matrix of class "dfmSparse"
## features
## docs The quick quick brown brown fox fox jumps jumps over over the the lazy lazy dog Over the dog jumped
## text1 1 1 1 1 1 1 1 1 0 0
## text2 0 1 1 0 0 0 1 1 1 1
## text3 1 1 1 1 1 1 1 1 0 0
## features
## docs jumped the the quick
## text1 0 0
## text2 1 1
## text3 0 0
topfeatures(bigramDfm, n = nfeature(bigramDfm))
## quick brown brown fox the lazy lazy dog The quick fox jumps jumps over over the Over the
## 3 3 3 3 2 2 2 2 1
## dog jumped jumped the the quick
## 1 1 1
我有一个这样的字符向量:
sent <- c("The quick brown fox jumps over the lazy dog.",
"Over the lazy dog jumped the quick brown fox.",
"The quick brown fox jumps over the lazy dog.")
我正在使用 textcnt()
生成二元语法,如下所示:
txt <- textcnt(sent, method = "string", split = " ", n=2, tolower = FALSE)
format(txt)
给我所有的双字母组
frq rank bytes Encoding
Over the 1 4.5 8 unknown
The quick 2 11.5 9 unknown
brown fox 2 11.5 9 unknown
brown fox. 1 4.5 10 unknown
dog jumped 1 4.5 10 unknown
dog. Over 1 4.5 9 unknown
fox jumps 2 11.5 9 unknown
fox. The 1 4.5 8 unknown
jumped the 1 4.5 10 unknown
jumps over 2 11.5 10 unknown
lazy dog 1 4.5 8 unknown
lazy dog. 2 11.5 9 unknown
over the 2 11.5 8 unknown
quick brown 3 15.5 11 unknown
the lazy 3 15.5 8 unknown
the quick 1 4.5 9 unknown
真实数据有更多的句子。我有两个问题:
1. 是否可以提及在生成的 ngram 中应截断每个句子末尾的点?
2. 是否可以防止生成跨越两个句子的 ngram? dog. Over
和 fox. The
您可以通过避免 textcnt 来避免 textcnt 中的特定 ngram。 :-) 为了充实@lukeA 的评论,这里是完整的 quanteda 解决方案。
require(quanteda)
packageVersion("quanteda")
## [1] ‘0.9.5.19’
这会将标记化为双字母组,同时删除标点符号。因为每个句子都是一个"document",双字母组永远不会跨越文档。
(bigramToks <- tokenize(sent, ngrams = 2, removePunct = TRUE, concatenator = " "))
tokenizedText object from 3 documents.
## Component 1 :
## [1] "The quick" "quick brown" "brown fox" "fox jumps" "jumps over" "over the" "the lazy" "lazy dog"
##
## Component 2 :
## [1] "Over the" "the lazy" "lazy dog" "dog jumped" "jumped the" "the quick" "quick brown" "brown fox"
##
## Component 3 :
## [1] "The quick" "quick brown" "brown fox" "fox jumps" "jumps over" "over the" "the lazy" "lazy dog"
要获得这些频率,您应该通过使用 dfm()
构建文档特征矩阵来列出二元组标记。 (注意:您可以跳过标记化步骤并直接使用 dfm(sent, ngrams = 2, toLower = FALSE, concatenator = " ")
完成此操作。)
(bigramDfm <- dfm(bigramToks, toLower = FALSE, verbose = FALSE))
## Document-feature matrix of: 3 documents, 12 features.
## 3 x 12 sparse Matrix of class "dfmSparse"
## features
## docs The quick quick brown brown fox fox jumps jumps over over the the lazy lazy dog Over the dog jumped
## text1 1 1 1 1 1 1 1 1 0 0
## text2 0 1 1 0 0 0 1 1 1 1
## text3 1 1 1 1 1 1 1 1 0 0
## features
## docs jumped the the quick
## text1 0 0
## text2 1 1
## text3 0 0
topfeatures(bigramDfm, n = nfeature(bigramDfm))
## quick brown brown fox the lazy lazy dog The quick fox jumps jumps over over the Over the
## 3 3 3 3 2 2 2 2 1
## dog jumped jumped the the quick
## 1 1 1