data.table 中的条件滚动字符串连接
Conditional rolling string concat in data.table
我有一个 data.table
从一个有点古怪的文件中获得:
library(data.table)
istub <- setDT(read.fwf( 'http://www.bls.gov/cex/pumd/2016/csxistub.txt',
widths=c(2,3,64,12,2,3,10), skip=1,
stringsAsFactors=FALSE, strip.white=TRUE,
col.names = c( "type", "level", "title", "UCC",
"survey", "factor","group" )
) )
该文件的一个怪癖是,如果 type==2
,该行仅包含前一行的 title
字段的延续。
所以,我想将延续 title
附加到上一行的标题。我假设每条普通行只有一个续行。
对于每个示例,请以:
开头
df <- copy(istub) # avoids extra requests of file
基础 R 解决方案:(期望的结果)
我知道我能做到:
# if type == 2, "title" field should be appended to the above row's "title" field
continued <- which(df$type==2)
# You can see that these titles are incomplete,
# e.g., "School books, supplies, equipment for vocational and"
tail(df$title[continued-1])
df$title[continued-1] <- paste(df$title[continued-1],df$title[continued])
# Now they're complete
# e.g., "School books, supplies, equipment for vocational and technical schools"
tail(df$title[continued-1])
# And we could get rid of the continuation lines
df <- df[-continued]
不过,我想练练一些data.table功。
尝试使用 data.table
首先,我尝试使用 shift()
对 .i
进行子集化,但这没有用:
df[shift(type, type='lead')==2,
title := paste(title, shift(title, type='lead') ) ] # doesn't work
这个有效:
df[,title := ifelse( shift(type, type='lead')==2,
paste(title, shift(title, type='lead')),
title ) ]
我是被两个 shift
困住了(似乎效率低下)还是有更好的方法?
我可以用 shift()
-ed ifelse()
来完成。
df[, title := paste0(title, shift( ifelse(type==2, paste0(' ',title), ''),
type='lead')
) ]
df <- df[type==1] # can get rid of continuation lines
看起来有点老套,paste0
-ing 一个几乎是空的字符串向量,欢迎改进。
ifelse
几乎总是可以避免并且值得避免的。**
我可能会...
# back up the data before editing values
df0 = copy(df)
# find rows
w = df[type == 2, which = TRUE]
# edit at rows up one
stopifnot(all(w > 1))
df[w-1, title := paste(title, df$title[w])]
# drop rows
res = df[-w]
** 一些示例...
问答
- Does ifelse really calculate both of its vectors every time? Is it slow?
解决方法
我有一个 data.table
从一个有点古怪的文件中获得:
library(data.table)
istub <- setDT(read.fwf( 'http://www.bls.gov/cex/pumd/2016/csxistub.txt',
widths=c(2,3,64,12,2,3,10), skip=1,
stringsAsFactors=FALSE, strip.white=TRUE,
col.names = c( "type", "level", "title", "UCC",
"survey", "factor","group" )
) )
该文件的一个怪癖是,如果 type==2
,该行仅包含前一行的 title
字段的延续。
所以,我想将延续 title
附加到上一行的标题。我假设每条普通行只有一个续行。
对于每个示例,请以:
开头df <- copy(istub) # avoids extra requests of file
基础 R 解决方案:(期望的结果)
我知道我能做到:
# if type == 2, "title" field should be appended to the above row's "title" field
continued <- which(df$type==2)
# You can see that these titles are incomplete,
# e.g., "School books, supplies, equipment for vocational and"
tail(df$title[continued-1])
df$title[continued-1] <- paste(df$title[continued-1],df$title[continued])
# Now they're complete
# e.g., "School books, supplies, equipment for vocational and technical schools"
tail(df$title[continued-1])
# And we could get rid of the continuation lines
df <- df[-continued]
不过,我想练练一些data.table功。
尝试使用 data.table
首先,我尝试使用 shift()
对 .i
进行子集化,但这没有用:
df[shift(type, type='lead')==2,
title := paste(title, shift(title, type='lead') ) ] # doesn't work
这个有效:
df[,title := ifelse( shift(type, type='lead')==2,
paste(title, shift(title, type='lead')),
title ) ]
我是被两个 shift
困住了(似乎效率低下)还是有更好的方法?
我可以用 shift()
-ed ifelse()
来完成。
df[, title := paste0(title, shift( ifelse(type==2, paste0(' ',title), ''),
type='lead')
) ]
df <- df[type==1] # can get rid of continuation lines
看起来有点老套,paste0
-ing 一个几乎是空的字符串向量,欢迎改进。
ifelse
几乎总是可以避免并且值得避免的。**
我可能会...
# back up the data before editing values
df0 = copy(df)
# find rows
w = df[type == 2, which = TRUE]
# edit at rows up one
stopifnot(all(w > 1))
df[w-1, title := paste(title, df$title[w])]
# drop rows
res = df[-w]
** 一些示例...
问答
- Does ifelse really calculate both of its vectors every time? Is it slow?
解决方法