为什么 featnames(myDFM) 包含不止一个或两个令牌的特征?
Why does featnames(myDFM) contain features of more than one or two tokens?
我正在处理一个 1M 的大型文档语料库,并在从中创建文档频率矩阵时应用了多个转换:
library(quanteda)
corpus_dfm <- dfm(tokens(corpus1M), # where corpus1M is already a corpus via quanteda::corpus()
remove = stopwords("english"),
#what = "word", #experimented if adding this made a difference
remove_punct = T,
remove_numbers = T,
remove_symbols = T,
ngrams = 1:2,
dictionary = lut_dict,
stem = TRUE)
然后看看生成的特征:
dimnames(corpus_dfm)$features
[1] "abandon"
[2] "abandoned auto"
[3] "abandoned vehicl"
...
[8] "accident hit and run"
...
[60] "assault no weapon aggravated injuri"
为什么这些特征的长度超过 1:2 双字母组?词干提取似乎已经成功,但标记似乎是句子而不是单词。
我尝试将我的代码调整为:dfm(tokens(corpus1M, what = "word")
但没有任何变化。
我试着做了一个可重现的小例子:
library(tidyverse) # just for the pipe here
example_text <- c("the quick brown fox",
"I like carrots",
"the there that etc cats dogs") %>% corpus
然后,如果我应用与上面相同的 dfm:
> dimnames(corpus_dfm)$features
[1] "etc."
这很奇怪,因为几乎所有的词都被删除了?甚至停用词也不像以前,所以我更困惑了!
尽管只是尝试,但我现在也无法创建可重现的示例。也许我误解了这个功能是如何工作的?
我如何在 quanteda 中创建只有 1:2 个单词标记并且删除了停用词的 dfm?
第一个问题:为什么dfm中的特征(名称)这么长?
答案:因为字典在 dfm()
调用中的应用用字典键替换了与您的一元组和二元组特征的匹配,并且您字典中的(许多)键由多个单词组成。示例:
lut_dict[70:72]
# Dictionary object with 3 key entries.
# - assault felony:
# - asf
# - assault misdemeanor:
# - asm
# - assault no weapon aggravated injury:
# - anai
第二个问题:在可重现的例子中,为什么几乎所有的词都消失了?
答案:因为字典值与 dfm 中的特征的唯一匹配是 "etc." 类别。
corpus_dfm2 <- dfm(tokens(example_text), # where corpus1M is already a corpus via quanteda::corpus()
remove = stopwords("english"),
remove_punct = TRUE,
remove_numbers = TRUE,
remove_symbols = TRUE,
dictionary = lut_dict,
ngrams = 1:2,
stem = TRUE, verbose = TRUE)
corpus_dfm2
# Document-feature matrix of: 3 documents, 1 feature (66.7% sparse).
# 3 x 1 sparse Matrix of class "dfmSparse"
# features
# docs etc.
# text1 0
# text2 0
# text3 1
lut_dict["etc."]
# Dictionary object with 1 key entry.
# - etc.:
# - etc
如果你不应用字典,那么你会看到:
dfm(tokens(example_text), # the "tokens" is not necessary here
remove = stopwords("english"),
remove_punct = TRUE,
remove_numbers = TRUE,
remove_symbols = TRUE,
ngrams = 1:2,
stem = TRUE)
# Document-feature matrix of: 3 documents, 18 features (66.7% sparse).
# 3 x 18 sparse Matrix of class "dfmSparse"
# features
# docs quick brown fox the_quick quick_brown brown_fox like carrot i_like
# text1 1 1 1 1 1 1 0 0 0
# text2 0 0 0 0 0 0 1 1 1
# text3 0 0 0 0 0 0 0 0 0
# features
# docs like_carrot etc cat dog the_there there_that that_etc etc_cat cat_dog
# text1 0 0 0 0 0 0 0 0 0
# text2 1 0 0 0 0 0 0 0 0
# text3 0 1 1 1 1 1 1 1 1
如果要保持特征不匹配,请将 dictionary
替换为 thesaurus
。下面,您将看到 "etc" 标记已被大写键 "ETC.":
替换
dfm(tokens(example_text),
remove = stopwords("english"),
remove_punct = TRUE,
remove_numbers = TRUE,
remove_symbols = TRUE,
thesaurus = lut_dict,
ngrams = 1:2,
stem = TRUE)
Document-feature matrix of: 3 documents, 18 features (66.7% sparse).
3 x 18 sparse Matrix of class "dfmSparse"
features
docs quick brown fox the_quick quick_brown brown_fox like carrot i_like
text1 1 1 1 1 1 1 0 0 0
text2 0 0 0 0 0 0 1 1 1
text3 0 0 0 0 0 0 0 0 0
features
docs like_carrot cat dog the_there there_that that_etc etc_cat cat_dog ETC.
text1 0 0 0 0 0 0 0 0 0
text2 1 0 0 0 0 0 0 0 0
text3 0 1 1 1 1 1 1 1 1
我正在处理一个 1M 的大型文档语料库,并在从中创建文档频率矩阵时应用了多个转换:
library(quanteda)
corpus_dfm <- dfm(tokens(corpus1M), # where corpus1M is already a corpus via quanteda::corpus()
remove = stopwords("english"),
#what = "word", #experimented if adding this made a difference
remove_punct = T,
remove_numbers = T,
remove_symbols = T,
ngrams = 1:2,
dictionary = lut_dict,
stem = TRUE)
然后看看生成的特征:
dimnames(corpus_dfm)$features
[1] "abandon"
[2] "abandoned auto"
[3] "abandoned vehicl"
...
[8] "accident hit and run"
...
[60] "assault no weapon aggravated injuri"
为什么这些特征的长度超过 1:2 双字母组?词干提取似乎已经成功,但标记似乎是句子而不是单词。
我尝试将我的代码调整为:dfm(tokens(corpus1M, what = "word")
但没有任何变化。
我试着做了一个可重现的小例子:
library(tidyverse) # just for the pipe here
example_text <- c("the quick brown fox",
"I like carrots",
"the there that etc cats dogs") %>% corpus
然后,如果我应用与上面相同的 dfm:
> dimnames(corpus_dfm)$features
[1] "etc."
这很奇怪,因为几乎所有的词都被删除了?甚至停用词也不像以前,所以我更困惑了! 尽管只是尝试,但我现在也无法创建可重现的示例。也许我误解了这个功能是如何工作的?
我如何在 quanteda 中创建只有 1:2 个单词标记并且删除了停用词的 dfm?
第一个问题:为什么dfm中的特征(名称)这么长?
答案:因为字典在 dfm()
调用中的应用用字典键替换了与您的一元组和二元组特征的匹配,并且您字典中的(许多)键由多个单词组成。示例:
lut_dict[70:72]
# Dictionary object with 3 key entries.
# - assault felony:
# - asf
# - assault misdemeanor:
# - asm
# - assault no weapon aggravated injury:
# - anai
第二个问题:在可重现的例子中,为什么几乎所有的词都消失了?
答案:因为字典值与 dfm 中的特征的唯一匹配是 "etc." 类别。
corpus_dfm2 <- dfm(tokens(example_text), # where corpus1M is already a corpus via quanteda::corpus()
remove = stopwords("english"),
remove_punct = TRUE,
remove_numbers = TRUE,
remove_symbols = TRUE,
dictionary = lut_dict,
ngrams = 1:2,
stem = TRUE, verbose = TRUE)
corpus_dfm2
# Document-feature matrix of: 3 documents, 1 feature (66.7% sparse).
# 3 x 1 sparse Matrix of class "dfmSparse"
# features
# docs etc.
# text1 0
# text2 0
# text3 1
lut_dict["etc."]
# Dictionary object with 1 key entry.
# - etc.:
# - etc
如果你不应用字典,那么你会看到:
dfm(tokens(example_text), # the "tokens" is not necessary here
remove = stopwords("english"),
remove_punct = TRUE,
remove_numbers = TRUE,
remove_symbols = TRUE,
ngrams = 1:2,
stem = TRUE)
# Document-feature matrix of: 3 documents, 18 features (66.7% sparse).
# 3 x 18 sparse Matrix of class "dfmSparse"
# features
# docs quick brown fox the_quick quick_brown brown_fox like carrot i_like
# text1 1 1 1 1 1 1 0 0 0
# text2 0 0 0 0 0 0 1 1 1
# text3 0 0 0 0 0 0 0 0 0
# features
# docs like_carrot etc cat dog the_there there_that that_etc etc_cat cat_dog
# text1 0 0 0 0 0 0 0 0 0
# text2 1 0 0 0 0 0 0 0 0
# text3 0 1 1 1 1 1 1 1 1
如果要保持特征不匹配,请将 dictionary
替换为 thesaurus
。下面,您将看到 "etc" 标记已被大写键 "ETC.":
dfm(tokens(example_text),
remove = stopwords("english"),
remove_punct = TRUE,
remove_numbers = TRUE,
remove_symbols = TRUE,
thesaurus = lut_dict,
ngrams = 1:2,
stem = TRUE)
Document-feature matrix of: 3 documents, 18 features (66.7% sparse).
3 x 18 sparse Matrix of class "dfmSparse"
features
docs quick brown fox the_quick quick_brown brown_fox like carrot i_like
text1 1 1 1 1 1 1 0 0 0
text2 0 0 0 0 0 0 1 1 1
text3 0 0 0 0 0 0 0 0 0
features
docs like_carrot cat dog the_there there_that that_etc etc_cat cat_dog ETC.
text1 0 0 0 0 0 0 0 0 0
text2 1 0 0 0 0 0 0 0 0
text3 0 1 1 1 1 1 1 1 1