当 quanteda 在句子级别标记化时,如何从标记中删除标点符号?
How to remove punctuation from tokens, when quanteda tokenizes at sentence level?
我的最终目标是 select 从语料库中匹配特定模式的一些句子并对这些 select 语料库中的剪辑进行情感分析。我正在尝试使用 R 中当前版本的 quanteda 来完成所有这些工作。
我注意到当 tokens
应用于句子级别 (what = "sentence"
) 时,remove_punctuation
不会删除标点符号。在将 selected sentence-tokens 分解为 word-tokens 进行情感分析时,word-tokens 将包含标点符号,例如“,”或“.”。字典将不再能够匹配这些标记。可重现的例子:
mypattern <- c("country", "honor")
#
txt <- c(wash1 <- "Fellow citizens, I am again called upon by the voice of my country to execute the functions of its Chief Magistrate.",
wash2 <- "When the occasion proper for it shall arrive, I shall endeavor to express the high sense I entertain of this distinguished honor.",
blind <- "Lorem ipsum dolor sit amet, consectetuer adipiscing elit, sed diam nonummy nibh euismod tincidunt ut laoreet dolore magna aliquam erat volutpat.")
#
toks <- tokens_select(tokens(txt, what = "sentence", remove_punct = TRUE),
pattern = paste0(mypattern, collapse = "|"),
valuetype = "regex",
selection = "keep")
#
toks
例如,toks
中的标记包含“公民”或“到达”。我想通过 tokens_split(toks, separator = " ")
将标记拆分回单词标记,但 separator
确实只允许一个输入参数。
在句子级别进行分词时,是否可以删除句子中的标点符号?
有更好的方法可以实现您的目标,其中包括仅对包含您的目标模式的文档中的句子执行情感分析。您可以通过首先将您的语料库重塑为句子,然后将它们标记化,然后使用 tokens_select()
和 window
参数来做到这一点 select 只有那些包含该模式的文档。在这种情况下,您将 window 设置得如此之大,以至于它将包括整个句子。
library("quanteda")
#> Package version: 3.2.1
#> Unicode version: 13.0
#> ICU version: 67.1
#> Parallel computing: 10 of 10 threads used.
#> See https://quanteda.io for tutorials and examples.
txt <- c("Fellow citizens, I am again called upon by the voice of my country to execute the functions of its Chief Magistrate.
When the occasion proper for it shall arrive, I shall endeavor to express the high sense I entertain of this distinguished honor.
Lorem ipsum dolor sit amet.")
corp <- corpus(txt)
corp_sent <- corpus_reshape(corp, to = "sentences")
corp_sent
#> Corpus consisting of 3 documents.
#> text1.1 :
#> "Fellow citizens, I am again called upon by the voice of my c..."
#>
#> text1.2 :
#> "When the occasion proper for it shall arrive, I shall endeav..."
#>
#> text1.3 :
#> "Lorem ipsum dolor sit amet."
# sentiment on just the documents with the pattern
mypattern <- c("country", "honor")
toks <- tokens(corp_sent) %>%
tokens_select(pattern = mypattern, window = 10000000)
toks
#> Tokens consisting of 3 documents.
#> text1.1 :
#> [1] "Fellow" "citizens" "," "I" "am" "again"
#> [7] "called" "upon" "by" "the" "voice" "of"
#> [ ... and 11 more ]
#>
#> text1.2 :
#> [1] "When" "the" "occasion" "proper" "for" "it"
#> [7] "shall" "arrive" "," "I" "shall" "endeavor"
#> [ ... and 12 more ]
#>
#> text1.3 :
#> character(0)
# now perform sentiment analysis on the selected tokens
tokens_lookup(toks, dictionary = data_dictionary_LSD2015) %>%
dfm()
#> Document-feature matrix of: 3 documents, 4 features (91.67% sparse) and 0 docvars.
#> features
#> docs negative positive neg_positive neg_negative
#> text1.1 0 0 0 0
#> text1.2 0 5 0 0
#> text1.3 0 0 0 0
由 reprex package (v2.0.1)
于 2022 年 3 月 22 日创建
请注意,如果您要排除空句子,只需使用 dfm_subset(dfmat, nfeat(dfmat) > 0)
,其中 dfmat
是您保存的输出情绪分析 dfm。
我的最终目标是 select 从语料库中匹配特定模式的一些句子并对这些 select 语料库中的剪辑进行情感分析。我正在尝试使用 R 中当前版本的 quanteda 来完成所有这些工作。
我注意到当 tokens
应用于句子级别 (what = "sentence"
) 时,remove_punctuation
不会删除标点符号。在将 selected sentence-tokens 分解为 word-tokens 进行情感分析时,word-tokens 将包含标点符号,例如“,”或“.”。字典将不再能够匹配这些标记。可重现的例子:
mypattern <- c("country", "honor")
#
txt <- c(wash1 <- "Fellow citizens, I am again called upon by the voice of my country to execute the functions of its Chief Magistrate.",
wash2 <- "When the occasion proper for it shall arrive, I shall endeavor to express the high sense I entertain of this distinguished honor.",
blind <- "Lorem ipsum dolor sit amet, consectetuer adipiscing elit, sed diam nonummy nibh euismod tincidunt ut laoreet dolore magna aliquam erat volutpat.")
#
toks <- tokens_select(tokens(txt, what = "sentence", remove_punct = TRUE),
pattern = paste0(mypattern, collapse = "|"),
valuetype = "regex",
selection = "keep")
#
toks
例如,toks
中的标记包含“公民”或“到达”。我想通过 tokens_split(toks, separator = " ")
将标记拆分回单词标记,但 separator
确实只允许一个输入参数。
在句子级别进行分词时,是否可以删除句子中的标点符号?
有更好的方法可以实现您的目标,其中包括仅对包含您的目标模式的文档中的句子执行情感分析。您可以通过首先将您的语料库重塑为句子,然后将它们标记化,然后使用 tokens_select()
和 window
参数来做到这一点 select 只有那些包含该模式的文档。在这种情况下,您将 window 设置得如此之大,以至于它将包括整个句子。
library("quanteda")
#> Package version: 3.2.1
#> Unicode version: 13.0
#> ICU version: 67.1
#> Parallel computing: 10 of 10 threads used.
#> See https://quanteda.io for tutorials and examples.
txt <- c("Fellow citizens, I am again called upon by the voice of my country to execute the functions of its Chief Magistrate.
When the occasion proper for it shall arrive, I shall endeavor to express the high sense I entertain of this distinguished honor.
Lorem ipsum dolor sit amet.")
corp <- corpus(txt)
corp_sent <- corpus_reshape(corp, to = "sentences")
corp_sent
#> Corpus consisting of 3 documents.
#> text1.1 :
#> "Fellow citizens, I am again called upon by the voice of my c..."
#>
#> text1.2 :
#> "When the occasion proper for it shall arrive, I shall endeav..."
#>
#> text1.3 :
#> "Lorem ipsum dolor sit amet."
# sentiment on just the documents with the pattern
mypattern <- c("country", "honor")
toks <- tokens(corp_sent) %>%
tokens_select(pattern = mypattern, window = 10000000)
toks
#> Tokens consisting of 3 documents.
#> text1.1 :
#> [1] "Fellow" "citizens" "," "I" "am" "again"
#> [7] "called" "upon" "by" "the" "voice" "of"
#> [ ... and 11 more ]
#>
#> text1.2 :
#> [1] "When" "the" "occasion" "proper" "for" "it"
#> [7] "shall" "arrive" "," "I" "shall" "endeavor"
#> [ ... and 12 more ]
#>
#> text1.3 :
#> character(0)
# now perform sentiment analysis on the selected tokens
tokens_lookup(toks, dictionary = data_dictionary_LSD2015) %>%
dfm()
#> Document-feature matrix of: 3 documents, 4 features (91.67% sparse) and 0 docvars.
#> features
#> docs negative positive neg_positive neg_negative
#> text1.1 0 0 0 0
#> text1.2 0 5 0 0
#> text1.3 0 0 0 0
由 reprex package (v2.0.1)
于 2022 年 3 月 22 日创建请注意,如果您要排除空句子,只需使用 dfm_subset(dfmat, nfeat(dfmat) > 0)
,其中 dfmat
是您保存的输出情绪分析 dfm。