quanteda 搭配和词形还原

quanteda collocations and lemmatization

我正在使用 Quanteda 软件包套件 来预处理一些文本数据。我想将搭配作为特征并入,并决定使用 textstat_collocations 函数。根据文档和我引用:

"tokens 对象 . . . 虽然支持识别 tokens 对象的搭配,但使用 character 或 corpus 对象会得到更好的结果,因为相对从已经标记化的文本中检测句子边界不完善。"

这很有道理,所以这里是:

library(dplyr)
library(tibble)
library(quanteda)
library(quanteda.textstats)

# Some sample data and lemmas
df= c("this column has a lot of missing data, 50% almost!",
                   "I am interested in missing data problems",
                   "missing data is a headache",
                   "how do you handle missing data?")

lemmas <- data.frame() %>%
    rbind(c("missing", "miss")) %>%
    rbind(c("data", "datum")) %>%
    `colnames<-`(c("inflected_form", "lemma"))

(1) 使用语料库对象生成搭配:

txtCorpus = corpus(df)
docvars(txtCorpus)$text <- as.character(txtCorpus)
myPhrases = textstat_collocations(txtCorpus, tolower = FALSE)

(2) 预处理文本并识别搭配并为下游任务进行词形还原。

# I used a blank space as concatenator and the phrase function as explained in the documentation and I followed the multi multi substitution example in the documentation
# https://quanteda.io/reference/tokens_replace.html
txtTokens = tokens(txtCorpus, remove_numbers = TRUE, remove_punct = TRUE, 
                               remove_symbols = TRUE, remove_separators = TRUE) %>%
    tokens_tolower() %>%
    tokens_compound(pattern = phrase(myPhrases$collocation), concatenator = " ") %>%
    tokens_replace(pattern=phrase(c(lemmas$inflected_form)), replacement=phrase(c(lemmas$lemma)))

(3) 测试结果

# Create dtm
dtm = dfm(txtTokens, remove_padding = TRUE)

# pull features
dfm_feat = as.data.frame(featfreq(dtm)) %>%
    rownames_to_column(var="feature") %>%
    `colnames<-`(c("feature", "count"))

dfm_feat
feature count
this 1
column 1
has 1
a 2
lot 1
of 1
almost 1
i 2
am 1
interested 1
in 1
problems 1
is 1
headache 1
how 1
do 1
you 1
handle 1
missing data 4

"缺失数据"应该是"缺失数据".

这仅适用于 df 中的每个文档都是单个单词的情况。如果我从一开始就使用令牌对象生成我的搭配,我可以使这个过程工作,但这不是我想要的。

问题是您已经将搭配的元素组合成包含 space 的单个“标记”,但是通过在 tokens_compound() 中提供 phrase() 包装器,您告诉 tokens_replace() 寻找两个连续的标记,而不是带有 space.

的标记

获得所需内容的方法是使词形化替换与搭配相匹配。

phrase_lemmas <- data.frame(
  inflected_form = "missing data",
  lemma = "miss datum"
)
tokens_replace(txtTokens, phrase_lemmas$inflected_form, phrase_lemmas$lemma)
## Tokens consisting of 4 documents and 1 docvar.
## text1 :
## [1] "this"       "column"     "has"        "a"          "lot"       
## [6] "of"         "miss datum" "almost"    
## 
## text2 :
## [1] "i"          "am"         "interested" "in"         "miss datum"
## [6] "problems"  
## 
## text3 :
## [1] "miss datum" "is"         "a"          "headache"  
## 
## text4 :
## [1] "how"        "do"         "you"        "handle"     "miss datum"

替代方法是直接对未复合的标记使用 tokens_lookup(),如果您有一个固定的序列列表,您想要匹配词形还原序列。例如,

tokens(txtCorpus) %>%
  tokens_lookup(dictionary(list("miss datum" = "missing data")),
    exclusive = FALSE, capkeys = FALSE
  )
## Tokens consisting of 4 documents and 1 docvar.
## text1 :
##  [1] "this"       "column"     "has"        "a"          "lot"       
##  [6] "of"         "miss datum" ","          "50"         "%"         
## [11] "almost"     "!"         
## 
## text2 :
## [1] "I"          "am"         "interested" "in"         "miss datum"
## [6] "problems"  
## 
## text3 :
## [1] "miss datum" "is"         "a"          "headache"  
## 
## text4 :
## [1] "how"        "do"         "you"        "handle"     "miss datum"
## [6] "?"