如何重写这个缓慢的 R 代码以提高效率?
How can this slow R code be rewritten to be more efficient?
大家好,我正在尝试使用 for 循环优化以下 R 代码,因为执行它需要很多时间。我什至尝试在 R 中使用编译器将函数转换为字节码,但性能更差。那么,有没有办法用应用函数编写这段代码
word_separation<-function(inp_data){
df<-NULL
for(k in 1:nrow(inp_data)){
vec<-unlist(strsplit(as.vector(inp_data[k,]),split=","))
if(length(vec)==1){
df<-rbind(df,data.frame(first_col=vec,second_col=vec))
}else{
temp_df<-NULL
for(i in 2:length(vec)){
for(j in i:length(vec){
temp_df<-rbind(temp_df,data.frame(first_col=vec[1],second_col=paste(vec[i:j],collapse=",")))
}
df<-rbind(df,temp_df)
df[df==""]<-NA
df<-df %>% unique() %>% na.omit()
}
}
}
return(df)
}
这里我的 inp_data 数据框有单列数据
column
Milk,Bread,Eggs,Jam
Apple,Milk,Beer
当传递给函数时 returns 一个包含列的数据框,第一列包含第一个单词,第二列包含数据框中其他单词的组合。
first_col second_col
Milk Bread
Milk Bread,Eggs
Milk Bread,Eggs,Jam
Milk Eggs
Milk Eggs,Jam
Milk Jam
Apple Milk
Apple Milk,Beer
Apple Beer
OP 已指定输入数据由一列组成。所以我们需要在创建组合之前拆分列。 (The answer given by Sathish 已默默跳过此步骤。)
下面的data.table
解决方案只使用了一个lapply()
。
数据
编辑: 添加了只有一个字的行
library(data.table)
inp_data <- fread(" column
Milk,Bread,Eggs,Jam
Apple,Milk,Beer
Butter", sep = "\n")
代码
# split strings, output in long format, add row number for later join
molten <- inp_data[, rn := .I][, strsplit(column, ","), by = rn]
# create combinations of all words (except the first one)
combined <- molten[, unlist(
lapply(seq_len(.N - 1), function(.i) as.data.table(
combn(V1[-1], .i, paste, collapse = ",", simplify = TRUE)))), by = rn]
# right join
combined[molten[, .(rn, first_col = first(V1)), by = rn],
.(rn, first_col, second_col = V1), on = "rn"]
# rn first_col second_col
# 1: 1 Milk Bread
# 2: 1 Milk Eggs
# 3: 1 Milk Jam
# 4: 1 Milk Bread,Eggs
# 5: 1 Milk Bread,Jam
# 6: 1 Milk Eggs,Jam
# 7: 1 Milk Bread,Eggs,Jam
# 8: 2 Apple Milk
# 9: 2 Apple Beer
#10: 2 Apple Milk,Beer
#11: 3 Butter NA
编辑:更改了联接以确保也包含仅包含一个单词的行。
大家好,我正在尝试使用 for 循环优化以下 R 代码,因为执行它需要很多时间。我什至尝试在 R 中使用编译器将函数转换为字节码,但性能更差。那么,有没有办法用应用函数编写这段代码
word_separation<-function(inp_data){
df<-NULL
for(k in 1:nrow(inp_data)){
vec<-unlist(strsplit(as.vector(inp_data[k,]),split=","))
if(length(vec)==1){
df<-rbind(df,data.frame(first_col=vec,second_col=vec))
}else{
temp_df<-NULL
for(i in 2:length(vec)){
for(j in i:length(vec){
temp_df<-rbind(temp_df,data.frame(first_col=vec[1],second_col=paste(vec[i:j],collapse=",")))
}
df<-rbind(df,temp_df)
df[df==""]<-NA
df<-df %>% unique() %>% na.omit()
}
}
}
return(df)
}
这里我的 inp_data 数据框有单列数据
column
Milk,Bread,Eggs,Jam
Apple,Milk,Beer
当传递给函数时 returns 一个包含列的数据框,第一列包含第一个单词,第二列包含数据框中其他单词的组合。
first_col second_col
Milk Bread
Milk Bread,Eggs
Milk Bread,Eggs,Jam
Milk Eggs
Milk Eggs,Jam
Milk Jam
Apple Milk
Apple Milk,Beer
Apple Beer
OP 已指定输入数据由一列组成。所以我们需要在创建组合之前拆分列。 (The answer given by Sathish 已默默跳过此步骤。)
下面的data.table
解决方案只使用了一个lapply()
。
数据
编辑: 添加了只有一个字的行
library(data.table)
inp_data <- fread(" column
Milk,Bread,Eggs,Jam
Apple,Milk,Beer
Butter", sep = "\n")
代码
# split strings, output in long format, add row number for later join
molten <- inp_data[, rn := .I][, strsplit(column, ","), by = rn]
# create combinations of all words (except the first one)
combined <- molten[, unlist(
lapply(seq_len(.N - 1), function(.i) as.data.table(
combn(V1[-1], .i, paste, collapse = ",", simplify = TRUE)))), by = rn]
# right join
combined[molten[, .(rn, first_col = first(V1)), by = rn],
.(rn, first_col, second_col = V1), on = "rn"]
# rn first_col second_col
# 1: 1 Milk Bread
# 2: 1 Milk Eggs
# 3: 1 Milk Jam
# 4: 1 Milk Bread,Eggs
# 5: 1 Milk Bread,Jam
# 6: 1 Milk Eggs,Jam
# 7: 1 Milk Bread,Eggs,Jam
# 8: 2 Apple Milk
# 9: 2 Apple Beer
#10: 2 Apple Milk,Beer
#11: 3 Butter NA
编辑:更改了联接以确保也包含仅包含一个单词的行。