根据包含日期的向量将行插入到数据框中

Inserting rows into a dataframe based on a vector that contains dates

这是我的数据框的样子:

df <- read.table(text='

    Name      ActivityType     ActivityDate              
     John       Email            2014-01-01                              
     John       Webinar          2014-01-05                            
     John       Webinar          2014-01-20                                                       
     John       Email            2014-04-20                            
     Tom        Email            2014-01-01                              
     Tom       Webinar           2014-01-05                           
     Tom       Webinar           2014-01-20                                                        
     Tom       Email             2014-04-20                              

    ', header=T, row.names = NULL)

我有这个包含不同日期的向量 x x<- c("2014-01-03","2014-01-25","2015-05-27")。我想以将这些日期合并到 x vector.This 中的方式在我的原始数据框中插入行,输出应该是这样的:

    Name      ActivityType     ActivityDate              
     John       Email            2014-01-01
     John        NA              2014-01-03        
     John       Webinar          2014-01-05                            
     John       Webinar          2014-01-20
     John       NA               2014-01-25                                                       
     John       Email            2014-04-20
     John       NA               2015-05-27                            
     Tom        Email            2014-01-01
     Tom        NA               2014-01-03                              
     Tom       Webinar           2014-01-05                           
     Tom       Webinar           2014-01-20
     Tom       NA                2014-01-25                                                        
     Tom       Email             2014-04-20
     Tom       NA                2015-05-27  

衷心感谢您的帮助!

1) expand.grid 使用 expand.grid 创建一个包含要添加的行的数据框 adds 然后使用 rbind 合并 dfadds,将 ActivityDate 列转换为 "Date" class。然后排序。没有使用包。

adds <- expand.grid(Name = levels(df$Name), ActivityType = NA, ActivityDate = x)
both <- transform(rbind(df, adds), ActivityDate = as.Date(ActivityDate))

o <- with(both, order(Name, ActivityDate))
both[o, ]

给予:

   Name ActivityType ActivityDate
1  John        Email   2014-01-01
9  John         <NA>   2014-01-03
2  John      Webinar   2014-01-05
3  John      Webinar   2014-01-20
11 John         <NA>   2014-01-25
4  John        Email   2014-04-20
13 John         <NA>   2015-05-27
5   Tom        Email   2014-01-01
10  Tom         <NA>   2014-01-03
6   Tom      Webinar   2014-01-05
7   Tom      Webinar   2014-01-20
12  Tom         <NA>   2014-01-25
8   Tom        Email   2014-04-20
14  Tom         <NA>   2015-05-27

2) sqldf 这会将 adds 和 df 上传到它动态创建的 sqlite 数据库,然后执行 sql查询并下载结果。计算发生在 R 之外,因此它可能适用于您的大数据。

adds <- data.frame(Name = NA, ActivityDate = x)

library(sqldf)

sqldf("select * 
       from (select * 
             from df 
             union 
             select a.Name, NULL ActivityType, ActivityDate 
             from (select distinct Name from df) a 
             cross join adds b
            ) order by 1, 3"
      )

给予:

   Name ActivityType ActivityDate
1  John        Email   2014-01-01
2  John         <NA>   2014-01-03
3  John      Webinar   2014-01-05
4  John      Webinar   2014-01-20
5  John         <NA>   2014-01-25
6  John        Email   2014-04-20
7  John         <NA>   2015-05-27
8   Tom        Email   2014-01-01
9   Tom         <NA>   2014-01-03
10  Tom      Webinar   2014-01-05
11  Tom      Webinar   2014-01-20
12  Tom         <NA>   2014-01-25
13  Tom        Email   2014-04-20
14  Tom         <NA>   2015-05-27

看起来你已经为每个人添加了 'new' 个日期,对吗?

在这种情况下,您可以将 x 变成 data.frame,然后 merge/join 变成

## original dataframe
df <- data.frame(Name = c(rep("John", 4), rep("Tom", 4)),
                 ActivityType = c("Email","Web","Web","Email","Email","Web","Web", "Email"),
                 ActivityDate = c("2014-01-01","2014-05-01","2014-20-01","2014-20-04","2014-01-01","2014-05-01","2014-20-01","2014-20-04"))

## Turning x into a dataframe.
x <- data.frame(ActivityDate = rep(c("2014-01-03","2014-01-25","2015-05-27"), 2),
                Name = rep(c("John","Tom"), 3))

merge(df, x, by=c("Name", "ActivityDate"), all=T)

#    Name ActivityDate ActivityType
# 1  John   2014-01-01        Email
# 2  John   2014-05-01          Web
# 3  John   2014-20-01          Web
# 4  John   2014-20-04        Email
# 5  John   2014-01-03         <NA>
# 6  John   2014-01-25         <NA>
# 7  John   2015-05-27         <NA>
# 8   Tom   2014-01-01        Email
# 9   Tom   2014-05-01          Web
# 10  Tom   2014-20-01          Web
# 11  Tom   2014-20-04        Email
# 12  Tom   2014-01-03         <NA>
# 13  Tom   2014-01-25         <NA>
# 14  Tom   2015-05-27         <NA>

更新

你内存有问题,可以这样用data.table

library(data.table)
dt <- as.data.table(df)
x_dt <- as.data.table(x)

merge(dt, x_dt, by=c("Name","ActivityDate"), all=T)

或者,如果您不想 merge,您可以 rbind 他们,使用 data.tablerbindlist

rbindlist(list(dt, x_dt), fill=TRUE)  ## fill sets the 'ActivityType' to NA in X

更新 2

用 16000 个唯一名称(我在这里使用数字,但原理是一样的)和 30 个日期生成你的 x

ActivityDates <- seq(as.Date("2014-01-01"), as.Date("2014-01-31"), by=1)
Names <- seq(1,16000)

x <- data.frame(Names = rep(Names, length(ActivityDates)),
                           ActivityDates = rep(ActivityDates, length(Names)))