如何禁止 ngrams 中的标点符号和空格？

Question

我有一个这样的字符向量：

sent <- c("The quick brown fox jumps over the lazy dog.",
          "Over the lazy dog jumped the quick brown fox.",
          "The quick brown fox jumps over the lazy dog.")

我正在使用 textcnt() 生成二元语法，如下所示：

txt <- textcnt(sent, method = "string", split = " ", n=2, tolower = FALSE)

format(txt) 给我所有的双字母组

              frq rank  bytes Encoding
Over the      1   4.5   8     unknown
The quick     2   11.5  9     unknown
brown fox     2   11.5  9     unknown
brown fox.    1   4.5   10    unknown
dog jumped    1   4.5   10    unknown
dog. Over     1   4.5   9     unknown
fox jumps     2   11.5  9     unknown
fox. The      1   4.5   8     unknown
jumped the    1   4.5   10    unknown
jumps over    2   11.5  10    unknown
lazy dog      1   4.5   8     unknown
lazy dog.     2   11.5  9     unknown
over the      2   11.5  8     unknown
quick brown   3   15.5  11    unknown
the lazy      3   15.5  8     unknown
the quick     1   4.5   9     unknown

真实数据有更多的句子。我有两个问题：
1. 是否可以提及在生成的 ngram 中应截断每个句子末尾的点？
2. 是否可以防止生成跨越两个句子的 ngram？ dog. Over 和 fox. The

Answer 1

您可以通过避免 textcnt 来避免 textcnt 中的特定 ngram。 :-) 为了充实@lukeA 的评论，这里是完整的 quanteda 解决方案。

require(quanteda)
packageVersion("quanteda")
## [1] ‘0.9.5.19’

这会将标记化为双字母组，同时删除标点符号。因为每个句子都是一个"document"，双字母组永远不会跨越文档。

(bigramToks <- tokenize(sent, ngrams = 2, removePunct = TRUE, concatenator = " "))
tokenizedText object from 3 documents.
## Component 1 :
## [1] "The quick"   "quick brown" "brown fox"   "fox jumps"   "jumps over"  "over the"    "the lazy"    "lazy dog"   
## 
## Component 2 :
## [1] "Over the"    "the lazy"    "lazy dog"    "dog jumped"  "jumped the"  "the quick"   "quick brown" "brown fox"  
## 
## Component 3 :
## [1] "The quick"   "quick brown" "brown fox"   "fox jumps"   "jumps over"  "over the"    "the lazy"    "lazy dog"

要获得这些频率，您应该通过使用 dfm() 构建文档特征矩阵来列出二元组标记。（注意：您可以跳过标记化步骤并直接使用 dfm(sent, ngrams = 2, toLower = FALSE, concatenator = " ") 完成此操作。）

(bigramDfm <- dfm(bigramToks, toLower = FALSE, verbose = FALSE))
## Document-feature matrix of: 3 documents, 12 features.
## 3 x 12 sparse Matrix of class "dfmSparse"
##        features
## docs    The quick quick brown brown fox fox jumps jumps over over the the lazy lazy dog Over the dog jumped
##   text1         1           1         1         1          1        1        1        1        0          0
##   text2         0           1         1         0          0        0        1        1        1          1
##   text3         1           1         1         1          1        1        1        1        0          0
## features
## docs    jumped the the quick
##   text1          0         0
##   text2          1         1
##   text3          0         0

topfeatures(bigramDfm, n = nfeature(bigramDfm))
## quick brown   brown fox    the lazy    lazy dog   The quick   fox jumps  jumps over    over the    Over the 
##           3           3           3           3           2           2           2           2           1 
##  dog jumped  jumped the   the quick 
##           1           1           1

如何禁止 ngrams 中的标点符号和空格？

How to forbid punctuation and whitespace inside ngrams?

whitespace

r

punctuation

n-gram