如何在 quanteda 令牌对象中找到令牌的位置?
How do I find the location of tokens in a quanteda token object?
我从一个纯文本文件创建了一个 quanteda 令牌对象,并选择了我要使用的特定词
tokens_select(truePdfAnnualReports.toks, unlist(strategicKeywords.list), padding = TRUE)
维护在原始文本文件中找到的特定标记序列。我现在希望将令牌位置编号(绝对和相对)分配给函数选择的令牌。如何为函数选择的令牌分配位置编号?
你想要 kwic()
,而不是 tokens_select()
。我使用下面的内置 data_corpus_inaugural
创建了一个可重现的示例答案。
library("quanteda")
## Package version: 3.1
## Unicode version: 13.0
## ICU version: 69.1
## Parallel computing: 12 of 12 threads used.
## See https://quanteda.io for tutorials and examples.
toks <- tokens(tail(data_corpus_inaugural, 10))
keywords <- c("nuclear", "security")
# form a data.frame from kwic() results
kw <- kwic(toks, keywords, window = 0) %>%
as.data.frame()
# for illustration
kw[10:14, ]
## docname from to pre keyword post pattern
## 10 1985-Reagan 2385 2385 security security
## 11 1989-Bush 2149 2149 security security
## 12 1997-Clinton 259 259 security security
## 13 1997-Clinton 1660 1660 nuclear nuclear
## 14 2001-Bush 872 872 Security security
现在,为了获得相对位置,我们可以在获得总标记长度和除法后做一些 dplyr 魔术:
doc_lengths <- data.frame(
docname = docnames(toks),
toklength = ntoken(toks)
)
# the answer
answer <- dplyr::left_join(kw, doc_lengths) %>%
dplyr::mutate(
from_relative = from / toklength,
to_relative = to / toklength
)
## Joining, by = "docname"
head(answer)
## docname from to pre keyword post pattern toklength from_relative
## 1 1985-Reagan 2005 2005 security security 2909 0.6892403
## 2 1985-Reagan 2152 2152 security security 2909 0.7397731
## 3 1985-Reagan 2189 2189 nuclear nuclear 2909 0.7524923
## 4 1985-Reagan 2210 2210 nuclear nuclear 2909 0.7597112
## 5 1985-Reagan 2245 2245 nuclear nuclear 2909 0.7717429
## 6 1985-Reagan 2310 2310 security security 2909 0.7940873
## to_relative
## 1 0.6892403
## 2 0.7397731
## 3 0.7524923
## 4 0.7597112
## 5 0.7717429
## 6 0.7940873
我从一个纯文本文件创建了一个 quanteda 令牌对象,并选择了我要使用的特定词
tokens_select(truePdfAnnualReports.toks, unlist(strategicKeywords.list), padding = TRUE)
维护在原始文本文件中找到的特定标记序列。我现在希望将令牌位置编号(绝对和相对)分配给函数选择的令牌。如何为函数选择的令牌分配位置编号?
你想要 kwic()
,而不是 tokens_select()
。我使用下面的内置 data_corpus_inaugural
创建了一个可重现的示例答案。
library("quanteda")
## Package version: 3.1
## Unicode version: 13.0
## ICU version: 69.1
## Parallel computing: 12 of 12 threads used.
## See https://quanteda.io for tutorials and examples.
toks <- tokens(tail(data_corpus_inaugural, 10))
keywords <- c("nuclear", "security")
# form a data.frame from kwic() results
kw <- kwic(toks, keywords, window = 0) %>%
as.data.frame()
# for illustration
kw[10:14, ]
## docname from to pre keyword post pattern
## 10 1985-Reagan 2385 2385 security security
## 11 1989-Bush 2149 2149 security security
## 12 1997-Clinton 259 259 security security
## 13 1997-Clinton 1660 1660 nuclear nuclear
## 14 2001-Bush 872 872 Security security
现在,为了获得相对位置,我们可以在获得总标记长度和除法后做一些 dplyr 魔术:
doc_lengths <- data.frame(
docname = docnames(toks),
toklength = ntoken(toks)
)
# the answer
answer <- dplyr::left_join(kw, doc_lengths) %>%
dplyr::mutate(
from_relative = from / toklength,
to_relative = to / toklength
)
## Joining, by = "docname"
head(answer)
## docname from to pre keyword post pattern toklength from_relative
## 1 1985-Reagan 2005 2005 security security 2909 0.6892403
## 2 1985-Reagan 2152 2152 security security 2909 0.7397731
## 3 1985-Reagan 2189 2189 nuclear nuclear 2909 0.7524923
## 4 1985-Reagan 2210 2210 nuclear nuclear 2909 0.7597112
## 5 1985-Reagan 2245 2245 nuclear nuclear 2909 0.7717429
## 6 1985-Reagan 2310 2310 security security 2909 0.7940873
## to_relative
## 1 0.6892403
## 2 0.7397731
## 3 0.7524923
## 4 0.7597112
## 5 0.7717429
## 6 0.7940873