如何在R中的Quanteda包中应用正则表达式来删除连续重复的标记（单词）

Question

我目前正在从事一个文本挖掘项目，在运行我的 ngrams 模型之后，我确实意识到我有重复单词的序列。我想删除重复的单词，同时保留它们的第一次出现。下面的代码演示了我打算做什么。谢谢！


textfun <- "This this this  this analysis analysis analysis should should remove remove remove all all all all all of of the the the the duplicated duplicated or or or repeated repeated repeated words words words and and return return return return return only their their first first first occurrence"

textfun <- corpus(textfun)

textfuntoks <- tokens(textfun)

textfunRef <- tokens_replace(textfuntoks, pattern = **?**, replacement = **?**, valuetype ="regex")

期望的结果是“此分析应删除所有重复或重复的单词，并且 return 仅删除它们的第一次出现”。我只对连续重复感兴趣。

我的主要问题是在“tokens_replace”函数中为“模式”和“替换”参数提出值。我尝试了不同的模式，其中一些模式改编自此处的资源，但 none 似乎有效。包含问题的图像。[5grams 频率分布显示诸如“swag”、“pleas”、“gas”、“books”、“chicago”、“happi”等词的实例] 1

Answer 1

您可以拆分每个单词的数据，使用rle查找连续出现的值并将第一个值粘贴在一起。

textfun <- "This this this this analysis analysis analysis should should remove remove remove all all all all all of of the the the the duplicated duplicated or or or repeated repeated repeated words words words and and return return return return return only their their first first first occurrence"

paste0(rle(tolower(strsplit(textfun, '\s+')[[1]]))$values, collapse = ' ')

#[1] "this analysis should remove all of the duplicated or repeated words and return only their first occurrence"

Answer 2

有趣的挑战。要在 quanteda 中执行此操作，您可以创建一个字典，将每个重复序列映射到它的单次出现。

library("quanteda")
## Package version: 3.0.0
## Unicode version: 10.0
## ICU version: 61.1
## Parallel computing: 12 of 12 threads used.
## See https://quanteda.io for tutorials and examples.

corp <- corpus("This this this  this will analysis analysis analysis should should remove remove remove all all all all all of of the the the the duplicated duplicated or or or repeated repeated repeated words words words and and return return return return return only their their first first first occurrence")
toks <- tokens(corp)

ngrams <- tokens_tolower(toks) %>%
  tokens_ngrams(n = 5:2, concatenator = " ") %>%
  as.character()
# choose only the ngrams that are all the same word
ngrams <- ngrams[lengths(sapply(strsplit(ngrams, split = " "), unique, simplify = TRUE)) == 1]
# remove duplicates
ngrams <- unique(ngrams)

head(ngrams, n = 3)
## [1] "all all all all all"                "return return return return return"
## [3] "this this this this"

所以这提供了一个包含所有（小写）重复值的向量。（为避免小写，删除 tokens_tolower() 行。）

现在我们创建一个字典，其中每个序列都是一个“值”，每个唯一的标记是“键”。构建 dict 的列表中将存在多个相同的键，但 dictionary() 构造函数会自动组合它们。创建后，可以使用 tokens_lookup().

将序列转换为单个标记

dict <- dictionary(
  structure(
    # this causes each ngram to be treated as a single "value"
    as.list(ngrams),
    # each dictionary key will be the unique token
    names = sapply(ngrams, function(x) strsplit(x, split = " ")[[1]][1], simplify = TRUE, USE.NAMES = FALSE)
  )
)

# convert the sequence to their keys
toks2 <- tokens_lookup(toks, dict, exclusive = FALSE, nested_scope = "dictionary", capkeys = FALSE)

print(toks2, max_ntoken = -1)
## Tokens consisting of 1 document.
## text1 :
##  [1] "this"       "will"       "analysis"   "should"     "remove"    
##  [6] "all"        "of"         "the"        "duplicated" "or"        
## [11] "repeated"   "words"      "and"        "return"     "only"      
## [16] "their"      "first"      "occurrence"

^{由 reprex package (v1.0.0)}

于 2021-04-08 创建

如何在R中的Quanteda包中应用正则表达式来删除连续重复的标记（单词）

How to apply regex in the Quanteda package in R to remove consecutively repeated tokens(words)

regex

nlp

r

n-gram

quanteda