将两列数据框转换为 Quanteda 字典格式
Transform Two Column Data Frame into Quanteda Dictionary Format
我的最终目标是创建一个 quanteda 词典,用于对文本数据进行主题分类。
但是,我的主题关键字以稍微不同的格式存储:我有一列包含大约 4000 个关键字,第二列指定每个关键字所属的主题。请注意,每个主题的单词数不相等。我的数据如下所示:
keywords topic
[1] "one" "number"
[2] "two" "number"
[3] "three" "number"
[4] "triangle" "form"
[5] "circle" "form"
[...]
如何将我的关键字转换为 (quanteda) 字典格式,即包含每个主题的命名向量的列表,每个主题包含相应主题的关键字?
列表应如下所示:
list(number = c("one","two","three"),
form = c("triangle","circle"))
非常感谢任何帮助!
到目前为止发现我的方法是错误的。但它对我来说似乎不正确(或工作):
# 1) Initialize an empty list of vectors that corresponds to my number of topics & add topic names ("topic_names" is just a vector type chr 1:88 that includes the topic names)
mydictionary <- vector(mode = "list", length = 88)
names(mydictionary ) <- topic_names
# 2) Create a loop that checks for each keyword to match a topic and adds it to the respective vector of my dictionary
# I got it working for one keyword like this:
if (names(mydictionary [1]) == keyword_list$topic[1]) { # if topic of keyword matches topic vector name
mydictionary[[1]] <- c(mydictionary[[1]], keyword_list$keywords[1]) #add keyword to topic vector
}
# However, I don't know how to transform this into a loop, since a loop has to check every index of keyword_list for every index of mydictionary and I don't know how to achieve this...
如果您的数据在 data.frame 类主题中(请参阅数据部分),您可以快速获取所需列表中的数据。您可以使用函数 split
.
my_dictionary <- split(topics$keywords, topics$topic)
my_dictionary
$form
[1] "triangle" "circle"
$number
[1] "one" "two" "three"
数据:
topics <- structure(list(keywords = c("one", "two", "three", "triangle",
"circle"), topic = c("number", "number", "number", "form", "form"
)), class = "data.frame", row.names = c(NA, -5L))
我的最终目标是创建一个 quanteda 词典,用于对文本数据进行主题分类。
但是,我的主题关键字以稍微不同的格式存储:我有一列包含大约 4000 个关键字,第二列指定每个关键字所属的主题。请注意,每个主题的单词数不相等。我的数据如下所示:
keywords topic
[1] "one" "number"
[2] "two" "number"
[3] "three" "number"
[4] "triangle" "form"
[5] "circle" "form"
[...]
如何将我的关键字转换为 (quanteda) 字典格式,即包含每个主题的命名向量的列表,每个主题包含相应主题的关键字?
列表应如下所示:
list(number = c("one","two","three"),
form = c("triangle","circle"))
非常感谢任何帮助!
到目前为止发现我的方法是错误的。但它对我来说似乎不正确(或工作):
# 1) Initialize an empty list of vectors that corresponds to my number of topics & add topic names ("topic_names" is just a vector type chr 1:88 that includes the topic names)
mydictionary <- vector(mode = "list", length = 88)
names(mydictionary ) <- topic_names
# 2) Create a loop that checks for each keyword to match a topic and adds it to the respective vector of my dictionary
# I got it working for one keyword like this:
if (names(mydictionary [1]) == keyword_list$topic[1]) { # if topic of keyword matches topic vector name
mydictionary[[1]] <- c(mydictionary[[1]], keyword_list$keywords[1]) #add keyword to topic vector
}
# However, I don't know how to transform this into a loop, since a loop has to check every index of keyword_list for every index of mydictionary and I don't know how to achieve this...
如果您的数据在 data.frame 类主题中(请参阅数据部分),您可以快速获取所需列表中的数据。您可以使用函数 split
.
my_dictionary <- split(topics$keywords, topics$topic)
my_dictionary
$form
[1] "triangle" "circle"
$number
[1] "one" "two" "three"
数据:
topics <- structure(list(keywords = c("one", "two", "three", "triangle",
"circle"), topic = c("number", "number", "number", "form", "form"
)), class = "data.frame", row.names = c(NA, -5L))