将新列添加到 data.table;使用循环分配创建
Add new column to a data.table; created using assign in loop
我有一个 data.frame keywordsCategory
,其中包含一组短语,我想根据要检查的单词对其进行分类。
比如我的一个"check terms"是test1,对应类别cat1。由于我的 data.frame 的第一个观察结果是 This is a test1,我需要在新列 category 中包含相应的类别.
因为一个观察可以分配给多个类别,我认为最好的选择是使用 grepl
创建我的 data.frame 的独立子集,以便最近将所有内容绑定到一个新的 data.frame
library(data.table)
wordsToCheck <- c("test1", "test2", "This")
categoryToAssign <- c("cat1", "cat2", "cat3")
keywordsCategory <- data.frame(Keyword=c("This is a test1", "This is a test2"))
for (i in 1:length(wordsToCheck)) {
myOriginal <- wordsToCheck[i]
myCategory <- categoryToAssign[i]
dfToCreate <- paste0("withCategory",i)
assign(dfToCreate,
data.table(keywordsCategory[grepl(paste0(".*",myOriginal,".*"),
keywordsCategory$Keyword)==TRUE,]))
# this wont work :(
# dfToCreate[,category:=myCategory]
}
# Create a list with all newly created data.tables
l.df <- lapply(ls(pattern="withCategory[0-9]+"), function(x) get(x))
# Create an aggregated dataframe with all Keywords data.tables
newdf <- do.call("rbind", l.df)
子集 > rbind 有效,但我无法将相应的类别分配给我新创建的 data.tables。如果我取消注释该行,我会收到以下错误:
Error in :=
(category, myCategory) : Check that
is.data.table(DT) == TRUE. Otherwise, := and :=
(...) are defined for
use in j, once only and in particular ways. See help(":=").
但是,如果我在循环完成后手动添加列,f.i:
withCategory1[,category:=myCategory]
它工作正常并且 table 输出符合预期:
> withCategory1
V1 category
1: This is a test1 cat2
tableOutput <- structure(list(V1 = structure(1L, .Label = c("This is a test1",
"This is a test2"), class = "factor"), category = "cat2"), .Names = c("V1",
"category"), row.names = c(NA, -1L), class = c("data.table",
"data.frame"), .internal.selfref = <pointer: 0x00000000001f0788>)
当 data.table 在循环内使用 assign 函数创建新列时,best/safest 方法是什么?该解决方案不需要使用 data.tables,因为我只使用它,因为我的真实数据有数百万个观察结果,我认为 data.table 会更快。
作为 for 循环的替代方法,您可以使用 paste0
、mapply
和 grepl
的组合来获得您想要的结果:
# create a 'data.table'
newDT <- as.data.table(keywordsCategory)
# assign the correct categories to each row
newDT[, category := paste0(categoryToAssign[mapply(grepl, wordsToCheck, Keyword)], collapse = ','), 1:nrow(newDT)]
给出:
> newDT
Keyword category
1: This is a test1 cat1,cat3
2: This is a test2 cat2,cat3
如果您想将类别列扩展到每一行的一个类别,请参阅 this Q&A 了解几种实现方法。例如:
library(splitstackshape)
cSplit(newDT, 'category', ",", direction = 'long')
你得到:
Keyword category
1: This is a test1 cat1
2: This is a test1 cat3
3: This is a test2 cat2
4: This is a test2 cat3
我有一个 data.frame keywordsCategory
,其中包含一组短语,我想根据要检查的单词对其进行分类。
比如我的一个"check terms"是test1,对应类别cat1。由于我的 data.frame 的第一个观察结果是 This is a test1,我需要在新列 category 中包含相应的类别.
因为一个观察可以分配给多个类别,我认为最好的选择是使用 grepl
创建我的 data.frame 的独立子集,以便最近将所有内容绑定到一个新的 data.frame
library(data.table)
wordsToCheck <- c("test1", "test2", "This")
categoryToAssign <- c("cat1", "cat2", "cat3")
keywordsCategory <- data.frame(Keyword=c("This is a test1", "This is a test2"))
for (i in 1:length(wordsToCheck)) {
myOriginal <- wordsToCheck[i]
myCategory <- categoryToAssign[i]
dfToCreate <- paste0("withCategory",i)
assign(dfToCreate,
data.table(keywordsCategory[grepl(paste0(".*",myOriginal,".*"),
keywordsCategory$Keyword)==TRUE,]))
# this wont work :(
# dfToCreate[,category:=myCategory]
}
# Create a list with all newly created data.tables
l.df <- lapply(ls(pattern="withCategory[0-9]+"), function(x) get(x))
# Create an aggregated dataframe with all Keywords data.tables
newdf <- do.call("rbind", l.df)
子集 > rbind 有效,但我无法将相应的类别分配给我新创建的 data.tables。如果我取消注释该行,我会收到以下错误:
Error in
:=
(category, myCategory) : Check that is.data.table(DT) == TRUE. Otherwise, := and:=
(...) are defined for use in j, once only and in particular ways. See help(":=").
但是,如果我在循环完成后手动添加列,f.i:
withCategory1[,category:=myCategory]
它工作正常并且 table 输出符合预期:
> withCategory1
V1 category
1: This is a test1 cat2
tableOutput <- structure(list(V1 = structure(1L, .Label = c("This is a test1",
"This is a test2"), class = "factor"), category = "cat2"), .Names = c("V1",
"category"), row.names = c(NA, -1L), class = c("data.table",
"data.frame"), .internal.selfref = <pointer: 0x00000000001f0788>)
当 data.table 在循环内使用 assign 函数创建新列时,best/safest 方法是什么?该解决方案不需要使用 data.tables,因为我只使用它,因为我的真实数据有数百万个观察结果,我认为 data.table 会更快。
作为 for 循环的替代方法,您可以使用 paste0
、mapply
和 grepl
的组合来获得您想要的结果:
# create a 'data.table'
newDT <- as.data.table(keywordsCategory)
# assign the correct categories to each row
newDT[, category := paste0(categoryToAssign[mapply(grepl, wordsToCheck, Keyword)], collapse = ','), 1:nrow(newDT)]
给出:
> newDT
Keyword category
1: This is a test1 cat1,cat3
2: This is a test2 cat2,cat3
如果您想将类别列扩展到每一行的一个类别,请参阅 this Q&A 了解几种实现方法。例如:
library(splitstackshape)
cSplit(newDT, 'category', ",", direction = 'long')
你得到:
Keyword category
1: This is a test1 cat1
2: This is a test1 cat3
3: This is a test2 cat2
4: This is a test2 cat3