去令牌化 Quanteda 令牌对象
Detokenize a Quanteda tokens object
我有一个使用“window”选项创建的 quanteda 令牌对象(参见下面的代码)。我有兴趣对一系列单词执行此操作,以便为自定义词典的创建提供信息。我如何“去标记化”或将每个标记化的“window”文本连接或重新组合成一个字符串。每个字符串可以是列表中的一个项目或 data.frame 中的一行。我只需要能够在其上下文中读取 word/phrase(在本例中为“未来”)的实例。
是否有一些命令或代码可以让我“去标记化”这个?
library(quanteda)
library(dplyr)
# Example data
d <- c("Thank you Mr. Speaker. Mr. Speaker I’m not sure how, but to the department of PWTTS, regarding the question I’d asked previously about the future of our water reservoir. I wonder if that was looked at since I ask that question to Ms. Thompson. Thank you", "Thank you Mr. Speaker. Now if that doctor would be located in that community how is the logistics or air travel going to be, moving between the communities in the future. Thank you")
# Corpus
c <- corpus(d)
# My tokens object consisting of 3-word window around instances of "future".
ttt <- tokens(c, remove_punct = T, remove_numbers = F) %>%
tokens_keep( pattern = "future", window = 3)
对于列表输出:
> lapply(ttt, paste, collapse = " ")
$text1
[1] "previously about the future of our water"
$text2
[1] "communities in the future Thank you"
或者对于字符向量,它很容易成为您 data.frame 中的列元素:
> vapply(ttt, paste, collapse = " ", character(1))
text1 text2
"previously about the future of our water" "communities in the future Thank you"
我有一个使用“window”选项创建的 quanteda 令牌对象(参见下面的代码)。我有兴趣对一系列单词执行此操作,以便为自定义词典的创建提供信息。我如何“去标记化”或将每个标记化的“window”文本连接或重新组合成一个字符串。每个字符串可以是列表中的一个项目或 data.frame 中的一行。我只需要能够在其上下文中读取 word/phrase(在本例中为“未来”)的实例。
是否有一些命令或代码可以让我“去标记化”这个?
library(quanteda)
library(dplyr)
# Example data
d <- c("Thank you Mr. Speaker. Mr. Speaker I’m not sure how, but to the department of PWTTS, regarding the question I’d asked previously about the future of our water reservoir. I wonder if that was looked at since I ask that question to Ms. Thompson. Thank you", "Thank you Mr. Speaker. Now if that doctor would be located in that community how is the logistics or air travel going to be, moving between the communities in the future. Thank you")
# Corpus
c <- corpus(d)
# My tokens object consisting of 3-word window around instances of "future".
ttt <- tokens(c, remove_punct = T, remove_numbers = F) %>%
tokens_keep( pattern = "future", window = 3)
对于列表输出:
> lapply(ttt, paste, collapse = " ")
$text1
[1] "previously about the future of our water"
$text2
[1] "communities in the future Thank you"
或者对于字符向量,它很容易成为您 data.frame 中的列元素:
> vapply(ttt, paste, collapse = " ", character(1))
text1 text2
"previously about the future of our water" "communities in the future Thank you"