quanteda 搭配和词形还原
quanteda collocations and lemmatization
我正在使用 Quanteda 软件包套件 来预处理一些文本数据。我想将搭配作为特征并入,并决定使用 textstat_collocations 函数。根据文档和我引用:
"tokens 对象 . . . 虽然支持识别 tokens 对象的搭配,但使用 character 或 corpus 对象会得到更好的结果,因为相对从已经标记化的文本中检测句子边界不完善。"
这很有道理,所以这里是:
library(dplyr)
library(tibble)
library(quanteda)
library(quanteda.textstats)
# Some sample data and lemmas
df= c("this column has a lot of missing data, 50% almost!",
"I am interested in missing data problems",
"missing data is a headache",
"how do you handle missing data?")
lemmas <- data.frame() %>%
rbind(c("missing", "miss")) %>%
rbind(c("data", "datum")) %>%
`colnames<-`(c("inflected_form", "lemma"))
(1) 使用语料库对象生成搭配:
txtCorpus = corpus(df)
docvars(txtCorpus)$text <- as.character(txtCorpus)
myPhrases = textstat_collocations(txtCorpus, tolower = FALSE)
(2) 预处理文本并识别搭配并为下游任务进行词形还原。
# I used a blank space as concatenator and the phrase function as explained in the documentation and I followed the multi multi substitution example in the documentation
# https://quanteda.io/reference/tokens_replace.html
txtTokens = tokens(txtCorpus, remove_numbers = TRUE, remove_punct = TRUE,
remove_symbols = TRUE, remove_separators = TRUE) %>%
tokens_tolower() %>%
tokens_compound(pattern = phrase(myPhrases$collocation), concatenator = " ") %>%
tokens_replace(pattern=phrase(c(lemmas$inflected_form)), replacement=phrase(c(lemmas$lemma)))
(3) 测试结果
# Create dtm
dtm = dfm(txtTokens, remove_padding = TRUE)
# pull features
dfm_feat = as.data.frame(featfreq(dtm)) %>%
rownames_to_column(var="feature") %>%
`colnames<-`(c("feature", "count"))
dfm_feat
feature
count
this
1
column
1
has
1
a
2
lot
1
of
1
almost
1
i
2
am
1
interested
1
in
1
problems
1
is
1
headache
1
how
1
do
1
you
1
handle
1
missing data
4
"缺失数据"应该是"缺失数据".
这仅适用于 df 中的每个文档都是单个单词的情况。如果我从一开始就使用令牌对象生成我的搭配,我可以使这个过程工作,但这不是我想要的。
问题是您已经将搭配的元素组合成包含 space 的单个“标记”,但是通过在 tokens_compound()
中提供 phrase()
包装器,您告诉 tokens_replace()
寻找两个连续的标记,而不是带有 space.
的标记
获得所需内容的方法是使词形化替换与搭配相匹配。
phrase_lemmas <- data.frame(
inflected_form = "missing data",
lemma = "miss datum"
)
tokens_replace(txtTokens, phrase_lemmas$inflected_form, phrase_lemmas$lemma)
## Tokens consisting of 4 documents and 1 docvar.
## text1 :
## [1] "this" "column" "has" "a" "lot"
## [6] "of" "miss datum" "almost"
##
## text2 :
## [1] "i" "am" "interested" "in" "miss datum"
## [6] "problems"
##
## text3 :
## [1] "miss datum" "is" "a" "headache"
##
## text4 :
## [1] "how" "do" "you" "handle" "miss datum"
替代方法是直接对未复合的标记使用 tokens_lookup()
,如果您有一个固定的序列列表,您想要匹配词形还原序列。例如,
tokens(txtCorpus) %>%
tokens_lookup(dictionary(list("miss datum" = "missing data")),
exclusive = FALSE, capkeys = FALSE
)
## Tokens consisting of 4 documents and 1 docvar.
## text1 :
## [1] "this" "column" "has" "a" "lot"
## [6] "of" "miss datum" "," "50" "%"
## [11] "almost" "!"
##
## text2 :
## [1] "I" "am" "interested" "in" "miss datum"
## [6] "problems"
##
## text3 :
## [1] "miss datum" "is" "a" "headache"
##
## text4 :
## [1] "how" "do" "you" "handle" "miss datum"
## [6] "?"
我正在使用 Quanteda 软件包套件 来预处理一些文本数据。我想将搭配作为特征并入,并决定使用 textstat_collocations 函数。根据文档和我引用:
"tokens 对象 . . . 虽然支持识别 tokens 对象的搭配,但使用 character 或 corpus 对象会得到更好的结果,因为相对从已经标记化的文本中检测句子边界不完善。"
这很有道理,所以这里是:
library(dplyr)
library(tibble)
library(quanteda)
library(quanteda.textstats)
# Some sample data and lemmas
df= c("this column has a lot of missing data, 50% almost!",
"I am interested in missing data problems",
"missing data is a headache",
"how do you handle missing data?")
lemmas <- data.frame() %>%
rbind(c("missing", "miss")) %>%
rbind(c("data", "datum")) %>%
`colnames<-`(c("inflected_form", "lemma"))
(1) 使用语料库对象生成搭配:
txtCorpus = corpus(df)
docvars(txtCorpus)$text <- as.character(txtCorpus)
myPhrases = textstat_collocations(txtCorpus, tolower = FALSE)
(2) 预处理文本并识别搭配并为下游任务进行词形还原。
# I used a blank space as concatenator and the phrase function as explained in the documentation and I followed the multi multi substitution example in the documentation
# https://quanteda.io/reference/tokens_replace.html
txtTokens = tokens(txtCorpus, remove_numbers = TRUE, remove_punct = TRUE,
remove_symbols = TRUE, remove_separators = TRUE) %>%
tokens_tolower() %>%
tokens_compound(pattern = phrase(myPhrases$collocation), concatenator = " ") %>%
tokens_replace(pattern=phrase(c(lemmas$inflected_form)), replacement=phrase(c(lemmas$lemma)))
(3) 测试结果
# Create dtm
dtm = dfm(txtTokens, remove_padding = TRUE)
# pull features
dfm_feat = as.data.frame(featfreq(dtm)) %>%
rownames_to_column(var="feature") %>%
`colnames<-`(c("feature", "count"))
dfm_feat
feature | count |
---|---|
this | 1 |
column | 1 |
has | 1 |
a | 2 |
lot | 1 |
of | 1 |
almost | 1 |
i | 2 |
am | 1 |
interested | 1 |
in | 1 |
problems | 1 |
is | 1 |
headache | 1 |
how | 1 |
do | 1 |
you | 1 |
handle | 1 |
missing data | 4 |
"缺失数据"应该是"缺失数据".
这仅适用于 df 中的每个文档都是单个单词的情况。如果我从一开始就使用令牌对象生成我的搭配,我可以使这个过程工作,但这不是我想要的。
问题是您已经将搭配的元素组合成包含 space 的单个“标记”,但是通过在 tokens_compound()
中提供 phrase()
包装器,您告诉 tokens_replace()
寻找两个连续的标记,而不是带有 space.
获得所需内容的方法是使词形化替换与搭配相匹配。
phrase_lemmas <- data.frame(
inflected_form = "missing data",
lemma = "miss datum"
)
tokens_replace(txtTokens, phrase_lemmas$inflected_form, phrase_lemmas$lemma)
## Tokens consisting of 4 documents and 1 docvar.
## text1 :
## [1] "this" "column" "has" "a" "lot"
## [6] "of" "miss datum" "almost"
##
## text2 :
## [1] "i" "am" "interested" "in" "miss datum"
## [6] "problems"
##
## text3 :
## [1] "miss datum" "is" "a" "headache"
##
## text4 :
## [1] "how" "do" "you" "handle" "miss datum"
替代方法是直接对未复合的标记使用 tokens_lookup()
,如果您有一个固定的序列列表,您想要匹配词形还原序列。例如,
tokens(txtCorpus) %>%
tokens_lookup(dictionary(list("miss datum" = "missing data")),
exclusive = FALSE, capkeys = FALSE
)
## Tokens consisting of 4 documents and 1 docvar.
## text1 :
## [1] "this" "column" "has" "a" "lot"
## [6] "of" "miss datum" "," "50" "%"
## [11] "almost" "!"
##
## text2 :
## [1] "I" "am" "interested" "in" "miss datum"
## [6] "problems"
##
## text3 :
## [1] "miss datum" "is" "a" "headache"
##
## text4 :
## [1] "how" "do" "you" "handle" "miss datum"
## [6] "?"