根据另一列的值连接一列的行

Concatenate rows of a column based on values of another column

我有这种格式的数据:

       df <- data.frame(seqpart=factor(c("", "ccagttgttg", "tttgattcg", "ctttgtc","", "ctttgtcga","cttagta", "ttactgt", "ttacat")), 
       seqinfo= factor(c("IDseq1|specie1", "", "","","IDseq2|specie2","","","","")))

 > df
   seqpart         seqinfo
   <NA>            IDseq1|specie1
   ccagttgttg      <NA>
   tttgattcg       <NA>
   ctttgtc         <NA>
   <NA>            IDseq2|specie2
   ctttgtcga       <NA>
   cttagta         <NA>
   ttactgt         <NA>
   ttacat          <NA>

我想根据列 seqinfo 连接行以使用这种新格式构造另一个数据:

>df1    
 seqinfo             seq
 IDseq1|specie1      ccagttgttgtttgattcgctttgtc
 IDseq2|specie2      ctttgtcgacttagtattactgtttacat

有办法吗?非常感谢

我们根据'seqinfo'中非空元素的存在创建一个分组变量('grp'),从'seqinfo'和[=11=中获取非空元素] 'seqpart'一起

library(data.table)
setDT(df)[, .(seqinfo = seqinfo[seqinfo!=''], 
  seqpart = paste(seqpart, collapse='')),.(grp = cumsum(seqinfo !=""))][, grp := NULL][]
#          seqinfo                       seqpart
#1: IDseq1|specie1    ccagttgttgtttgattcgctttgtc
#2: IDseq2|specie2 ctttgtcgacttagtattactgtttacat

来自 tidyverse 的另一个想法。我们先把 '' 替换成 NA 并填充。我们按 seqinfo 分组并粘贴唯一的 seqparts,即

library(tidyverse)

df %>% 
 mutate_all(funs(replace(., . == '', NA))) %>% 
 fill(seqpart, .direction = 'up') %>% 
 fill(seqinfo) %>% 
 group_by(seqinfo) %>% 
 summarise(seqpart = paste(unique(seqpart), collapse = ''))
  A tibble: 2 x 2
         seqinfo                       seqpart
          <fctr>                         <chr>
1 IDseq1|specie1    ccagttgttgtttgattcgctttgtc
2 IDseq2|specie2 ctttgtcgacttagtattactgtttacat

还有一个替代 data.table 解决方案,它使用 na.locf()(上次观察结转):

library(data.table)
data.table(df)[, seqinfo := zoo::na.locf(droplevels(seqinfo, ""))][
  , .(seq = paste(seqpart, collapse = "")), by = seqinfo]
          seqinfo                           seq
1: IDseq1|specie1    ccagttgttgtttgattcgctttgtc
2: IDseq2|specie2 ctttgtcgacttagtattactgtttacat

数据

df <- data.frame(
  seqpart=factor(c("", "ccagttgttg", "tttgattcg", "ctttgtc", "", "ctttgtcga",
                   "cttagta", "ttactgt", "ttacat")), 
  seqinfo= factor(c("IDseq1|specie1", "", "", "", "IDseq2|specie2", "", "", "", "")))

NA

的变体

如果空条目被编码为 NA 而不是 "",则可以跳过对 droplevels() 的调用:

df1 <- fread(
"   seqpart         seqinfo
   <NA>            IDseq1|specie1
  ccagttgttg      <NA>
  tttgattcg       <NA>
  ctttgtc         <NA>
  <NA>            IDseq2|specie2
  ctttgtcga       <NA>
  cttagta         <NA>
  ttactgt         <NA>
  ttacat          <NA>",
  na.strings = "<NA>"
)

data.table(df1)[, seqinfo := zoo::na.locf(seqinfo)][
  , .(seq = paste(seqpart, collapse = "")), by = seqinfo]