当 quanteda 在句子级别标记化时，如何从标记中删除标点符号？

Question

我的最终目标是 select 从语料库中匹配特定模式的一些句子并对这些 select 语料库中的剪辑进行情感分析。我正在尝试使用 R 中当前版本的 quanteda 来完成所有这些工作。

我注意到当 tokens 应用于句子级别 (what = "sentence") 时，remove_punctuation 不会删除标点符号。在将 selected sentence-tokens 分解为 word-tokens 进行情感分析时，word-tokens 将包含标点符号，例如“,”或“.”。字典将不再能够匹配这些标记。可重现的例子：

mypattern <- c("country", "honor")
#
txt <- c(wash1 <- "Fellow citizens, I am again called upon by the voice of my country to execute the functions of its Chief Magistrate.",
         wash2 <- "When the occasion proper for it shall arrive, I shall endeavor to express the high sense I entertain of this distinguished honor.", 
         blind <- "Lorem ipsum dolor sit amet, consectetuer adipiscing elit, sed diam nonummy nibh euismod tincidunt ut laoreet dolore magna aliquam erat volutpat.")
#
toks <- tokens_select(tokens(txt, what = "sentence", remove_punct = TRUE), 
                             pattern = paste0(mypattern, collapse = "|"), 
                             valuetype = "regex", 
                             selection = "keep")
#
toks

例如，toks 中的标记包含“公民”或“到达”。我想通过 tokens_split(toks, separator = " ") 将标记拆分回单词标记，但 separator 确实只允许一个输入参数。

在句子级别进行分词时，是否可以删除句子中的标点符号？

Answer 1

有更好的方法可以实现您的目标，其中包括仅对包含您的目标模式的文档中的句子执行情感分析。您可以通过首先将您的语料库重塑为句子，然后将它们标记化，然后使用 tokens_select() 和 window 参数来做到这一点 select 只有那些包含该模式的文档。在这种情况下，您将 window 设置得如此之大，以至于它将包括整个句子。

library("quanteda")
#> Package version: 3.2.1
#> Unicode version: 13.0
#> ICU version: 67.1
#> Parallel computing: 10 of 10 threads used.
#> See https://quanteda.io for tutorials and examples.

txt <- c("Fellow citizens, I am again called upon by the voice of my country to execute the functions of its Chief Magistrate.
          When the occasion proper for it shall arrive, I shall endeavor to express the high sense I entertain of this distinguished honor.
          Lorem ipsum dolor sit amet.")
corp <- corpus(txt)

corp_sent <- corpus_reshape(corp, to = "sentences")
corp_sent
#> Corpus consisting of 3 documents.
#> text1.1 :
#> "Fellow citizens, I am again called upon by the voice of my c..."
#> 
#> text1.2 :
#> "When the occasion proper for it shall arrive, I shall endeav..."
#> 
#> text1.3 :
#> "Lorem ipsum dolor sit amet."

# sentiment on just the documents with the pattern
mypattern <- c("country", "honor")
toks <- tokens(corp_sent) %>%
  tokens_select(pattern = mypattern, window = 10000000)
toks
#> Tokens consisting of 3 documents.
#> text1.1 :
#>  [1] "Fellow"   "citizens" ","        "I"        "am"       "again"   
#>  [7] "called"   "upon"     "by"       "the"      "voice"    "of"      
#> [ ... and 11 more ]
#> 
#> text1.2 :
#>  [1] "When"     "the"      "occasion" "proper"   "for"      "it"      
#>  [7] "shall"    "arrive"   ","        "I"        "shall"    "endeavor"
#> [ ... and 12 more ]
#> 
#> text1.3 :
#> character(0)

# now perform sentiment analysis on the selected tokens
tokens_lookup(toks, dictionary = data_dictionary_LSD2015) %>%
  dfm()
#> Document-feature matrix of: 3 documents, 4 features (91.67% sparse) and 0 docvars.
#>          features
#> docs      negative positive neg_positive neg_negative
#>   text1.1        0        0            0            0
#>   text1.2        0        5            0            0
#>   text1.3        0        0            0            0

^{由 reprex package (v2.0.1)}

于 2022 年 3 月 22 日创建

请注意，如果您要排除空句子，只需使用 dfm_subset(dfmat, nfeat(dfmat) > 0)，其中 dfmat 是您保存的输出情绪分析 dfm。

当 quanteda 在句子级别标记化时，如何从标记中删除标点符号？

How to remove punctuation from tokens, when quanteda tokenizes at sentence level?

nlp

r

token

quanteda