data.table 中的条件滚动字符串连接

Conditional rolling string concat in data.table

我有一个 data.table 从一个有点古怪的文件中获得:

library(data.table)

istub  <- setDT(read.fwf( 'http://www.bls.gov/cex/pumd/2016/csxistub.txt', 
                          widths=c(2,3,64,12,2,3,10), skip=1,
                          stringsAsFactors=FALSE, strip.white=TRUE,
                          col.names = c( "type", "level", "title", "UCC", 
                                         "survey", "factor","group" )
                ) )

该文件的一个怪癖是,如果 type==2,该行仅包含前一行的 title 字段的延续。

所以,我想将延续 title 附加到上一行的标题。我假设每条普通行只有一个续行。

对于每个示例,请以:

开头
df <- copy(istub) # avoids extra requests of file

基础 R 解决方案:(期望的结果)

我知道我能做到:

# if type == 2, "title" field should be appended to the above row's "title" field
continued <- which(df$type==2)

# You can see that these titles are incomplete,
#  e.g., "School books, supplies, equipment for vocational and"  
tail(df$title[continued-1])

df$title[continued-1] <- paste(df$title[continued-1],df$title[continued])

# Now they're complete
# e.g., "School books, supplies, equipment for vocational and technical schools"    
tail(df$title[continued-1])

# And we could get rid of the continuation lines
df <- df[-continued]

不过,我想练练一些data.table功。

尝试使用 data.table

首先,我尝试使用 shift().i 进行子集化,但这没有用:

df[shift(type, type='lead')==2, 
     title := paste(title, shift(title, type='lead') ) ] # doesn't work

这个有效:

df[,title := ifelse( shift(type, type='lead')==2,
                     paste(title, shift(title, type='lead')),
                     title ) ]

我是被两个 shift 困住了(似乎效率低下)还是有更好的方法?

我可以用 shift()-ed ifelse() 来完成。

df[, title := paste0(title, shift( ifelse(type==2, paste0(' ',title), ''),
                                   type='lead')
                     ) ]
df <- df[type==1] # can get rid of continuation lines

看起来有点老套,paste0-ing 一个几乎是空的字符串向量,欢迎改进。

ifelse 几乎总是可以避免并且值得避免的。**

我可能会...

# back up the data before editing values
df0 = copy(df)

# find rows
w = df[type == 2, which = TRUE]

# edit at rows up one
stopifnot(all(w > 1))
df[w-1, title := paste(title, df$title[w])]

# drop rows
res = df[-w]

** 一些示例...

问答

  • Does ifelse really calculate both of its vectors every time? Is it slow?

解决方法