如何在 unigrams 中保留字内句点？量子

Question

我想在我的 unigram 频率 table 中保留两个字母首字母缩略词，它们由 "t.v." 和 "u.s." 等句点分隔。当我用 quanteda 构建我的 unigram 频率 table 时，终止周期被截断了。这里有一个小的测试语料库来说明。我删除了作为句子分隔符的句点：

SOS This is the u.s. where our politics is crazy EOS

SOS In the US we watch a lot of t.v. aka TV EOS

SOS TV is an important part of life in the US EOS

SOS folks outside the u.s. probably don't watch so much t.v. EOS

SOS living in other countries is probably not any less crazy EOS

SOS i enjoy my sanity when it comes to visit EOS

我将其作为字符向量加载到 R 中：

acro.test <- c("SOS This is the u.s. where our politics is crazy EOS", "SOS In the US we watch a lot of t.v. aka TV EOS", "SOS TV is an important part of life in the US EOS", "SOS folks outside the u.s. probably don't watch so much t.v. EOS", "SOS living in other countries is probably not any less crazy EOS", "SOS i enjoy my sanity when it comes to visit EOS")

这是我用来构建我的 unigram 频率的代码 table:

library(quanteda)
dat.dfm <- dfm(acro.test, ngrams=1, verbose=TRUE, concatenator=" ",  toLower=FALSE, removeNumbers=TRUE, removePunct=FALSE, stopwords=FALSE)
dat.mat <- as.data.frame(as.matrix(docfreq(dat.dfm)))
ng.sorted <- sort(rowSums(dat.mat), decreasing=TRUE)
freqTable <- data.frame(ngram=names(ng.sorted), frequency = ng.sorted)
row.names(freqTable) <- NULL
freqTable

这会产生以下结果：

       ngram frequency
1        SOS         6
2        EOS         6
3        the         4
4         is         3
5          .         3
6        u.s         2
7      crazy         2
8         US         2
9      watch         2
10        of         2
11       t.v         2
12        TV         2
13        in         2
14  probably         2
15      This         1
16     where         1
17       our         1
18  politics         1
19        In         1
20        we         1
21         a         1
22       lot         1
23       aka         1

等...

我想将末期保持在 t.v。和 u.s。以及删除 table 中的条目。频率为 3.

我也不明白为什么 table 中的句点 (.) 计数为 3 而正确计算 u.s 和 t.v 一元字母组（每个 2 个） .

Answer 1

此行为的原因是 quanteda 的默认单词标记器使用基于 ICU 的单词边界定义（来自 stringi 包裹）。 u.s. 显示为单词 u.s. 后跟句点 . 标记。如果您的名字是 will.i.am，这很好，但对于您的目的来说可能不太好。但是您可以使用传递给 tokens() 的参数 what = "fasterword" 轻松切换到 white-space 标记器，该选项在 dfm() 中通过 ... 部分可用函数调用。

tokens(acro.test, what = "fasterword")[[1]]
## [1] "SOS"      "This"     "is"       "the"      "u.s."     "where"    "our"      "politics" "is"       "crazy"    "EOS"

可以看到，这里保留了u.s.。 针对您的最后一个问题，终端 . 的文档频率为 3，因为它作为单独的标记出现在三个文档中，这是当 remove_punct = FALSE.

为了将其传递给 dfm()，然后构建单词文档频率的 data.frame，以下代码有效（为了提高效率，我对其进行了一些整理）。请注意有关文档和术语频率之间差异的评论 - 我注意到一些用户对 docfreq().

有点困惑

# I removed the options that were the same as the default 
# note also that stopwords = TRUE is not a valid argument - see remove parameter
dat.dfm <- dfm(acro.test, tolower = FALSE, remove_punct = FALSE, what = "fasterword")

# sort in descending document frequency
dat.dfm <- dat.dfm[, names(sort(docfreq(dat.dfm), decreasing = TRUE))]
# Note: this would sort the dfm in descending total term frequency
#       not the same as docfreq
# dat.dfm <- sort(dat.dfm)

# this creates the data.frame in one more efficient step
freqTable <- data.frame(ngram = featnames(dat.dfm), frequency = docfreq(dat.dfm),
                        row.names = NULL, stringsAsFactors = FALSE)
head(freqTable, 10)
##    ngram frequency
## 1    SOS         6
## 2    EOS         6
## 3    the         4
## 4     is         3
## 5   u.s.         2
## 6  crazy         2
## 7     US         2
## 8  watch         2
## 9     of         2
## 10  t.v.         2

在我看来，由 docfreq() 在 dfm 上生成的命名向量是一种比 data.frame 方法更有效的存储结果的方法，但您可能希望添加其他变量。

如何在 unigrams 中保留字内句点？量子

How do I keep intra-word periods in unigrams? R quanteda

nlp

r

n-gram

quanteda