如何禁止 ngrams 中的标点符号和空格?

How to forbid punctuation and whitespace inside ngrams?

我有一个这样的字符向量:

sent <- c("The quick brown fox jumps over the lazy dog.",
          "Over the lazy dog jumped the quick brown fox.",
          "The quick brown fox jumps over the lazy dog.")

我正在使用 textcnt() 生成二元语法,如下所示:

txt <- textcnt(sent, method = "string", split = " ", n=2, tolower = FALSE)

format(txt) 给我所有的双字母组

              frq rank  bytes Encoding
Over the      1   4.5   8     unknown
The quick     2   11.5  9     unknown
brown fox     2   11.5  9     unknown
brown fox.    1   4.5   10    unknown
dog jumped    1   4.5   10    unknown
dog. Over     1   4.5   9     unknown
fox jumps     2   11.5  9     unknown
fox. The      1   4.5   8     unknown
jumped the    1   4.5   10    unknown
jumps over    2   11.5  10    unknown
lazy dog      1   4.5   8     unknown
lazy dog.     2   11.5  9     unknown
over the      2   11.5  8     unknown
quick brown   3   15.5  11    unknown
the lazy      3   15.5  8     unknown
the quick     1   4.5   9     unknown  

真实数据有更多的句子。我有两个问题:
1. 是否可以提及在生成的 ngram 中应截断每个句子末尾的点?
2. 是否可以防止生成跨越两个句子的 ngram? dog. Overfox. The

您可以通过避免 textcnt 来避免 textcnt 中的特定 ngram。 :-) 为了充实@lukeA 的评论,这里是完整的 quanteda 解决方案。

require(quanteda)
packageVersion("quanteda")
## [1] ‘0.9.5.19’

这会将标记化为双字母组,同时删除标点符号。因为每个句子都是一个"document",双字母组永远不会跨越文档。

(bigramToks <- tokenize(sent, ngrams = 2, removePunct = TRUE, concatenator = " "))
tokenizedText object from 3 documents.
## Component 1 :
## [1] "The quick"   "quick brown" "brown fox"   "fox jumps"   "jumps over"  "over the"    "the lazy"    "lazy dog"   
## 
## Component 2 :
## [1] "Over the"    "the lazy"    "lazy dog"    "dog jumped"  "jumped the"  "the quick"   "quick brown" "brown fox"  
## 
## Component 3 :
## [1] "The quick"   "quick brown" "brown fox"   "fox jumps"   "jumps over"  "over the"    "the lazy"    "lazy dog"   

要获得这些频率,您应该通过使用 dfm() 构建文档特征矩阵来列出二元组标记。 (注意:您可以跳过标记化步骤并直接使用 dfm(sent, ngrams = 2, toLower = FALSE, concatenator = " ") 完成此操作。)

(bigramDfm <- dfm(bigramToks, toLower = FALSE, verbose = FALSE))
## Document-feature matrix of: 3 documents, 12 features.
## 3 x 12 sparse Matrix of class "dfmSparse"
##        features
## docs    The quick quick brown brown fox fox jumps jumps over over the the lazy lazy dog Over the dog jumped
##   text1         1           1         1         1          1        1        1        1        0          0
##   text2         0           1         1         0          0        0        1        1        1          1
##   text3         1           1         1         1          1        1        1        1        0          0
## features
## docs    jumped the the quick
##   text1          0         0
##   text2          1         1
##   text3          0         0

topfeatures(bigramDfm, n = nfeature(bigramDfm))
## quick brown   brown fox    the lazy    lazy dog   The quick   fox jumps  jumps over    over the    Over the 
##           3           3           3           3           2           2           2           2           1 
##  dog jumped  jumped the   the quick 
##           1           1           1