将两列数据框转换为 Quanteda 字典格式

Transform Two Column Data Frame into Quanteda Dictionary Format

我的最终目标是创建一个 quanteda 词典,用于对文本数据进行主题分类。

但是,我的主题关键字以稍微不同的格式存储:我有一列包含大约 4000 个关键字,第二列指定每个关键字所属的主题。请注意,每个主题的单词数不相等。我的数据如下所示:

     keywords      topic
[1]  "one"         "number"
[2]  "two"         "number"
[3]  "three"       "number"
[4]  "triangle"    "form"
[5]  "circle"      "form"
[...]

如何将我的关键字转换为 (quanteda) 字典格式,即包含每个主题的命名向量的列表,每个主题包含相应主题的关键字?

列表应如下所示:

list(number = c("one","two","three"),
     form = c("triangle","circle"))

非常感谢任何帮助!

到目前为止发现我的方法是错误的。但它对我来说似乎不正确(或工作):

# 1) Initialize an empty list of vectors that corresponds to my number of topics & add topic names ("topic_names" is just a vector type chr 1:88 that includes the topic names)

mydictionary <- vector(mode = "list", length = 88) 
names(mydictionary ) <- topic_names

# 2) Create a loop that checks for each keyword to match a topic and adds it to the respective vector of my dictionary

# I got it working for one keyword like this:
if (names(mydictionary [1]) == keyword_list$topic[1]) { # if topic of keyword matches topic vector name
  mydictionary[[1]] <- c(mydictionary[[1]], keyword_list$keywords[1]) #add keyword to topic vector
}

# However, I don't know how to transform this into a loop, since a loop has to check every index of keyword_list for every index of mydictionary and I don't know how to achieve this...

如果您的数据在 data.frame 类主题中(请参阅数据部分),您可以快速获取所需列表中的数据。您可以使用函数 split.

my_dictionary <- split(topics$keywords, topics$topic)
my_dictionary

$form
[1] "triangle" "circle"  

$number
[1] "one"   "two"   "three"

数据:

topics <- structure(list(keywords = c("one", "two", "three", "triangle", 
"circle"), topic = c("number", "number", "number", "form", "form"
)), class = "data.frame", row.names = c(NA, -5L))