从具有不同分隔符的文件创建数据框

Create a dataframe from a file with different delimiters

我需要根据不同的分隔符和关键字创建一个 table。我有以下文件:

    >>>ENST00000370225_4
    >7E7I_A Chain A, Retinal-specific phospholipid-transporting ATPase ABCA4 
    [Homo sapiens]
     7E7O_A Chain A, Retinal-specific phospholipid-transporting ATPase ABCA4 
    [Homo sapiens]
    Length=2317
    
     Score = 4711 bits (12220),  Expect = 0.0, Method: Compositional matrix adjust.
     Identities = 2273/2273 (100%), Positives = 2273/2273 (100%), Gaps = 0/2273 (0%)
    
    >NP_000341.2 retinal-specific phospholipid-transporting ATPase ABCA4 [Homo 
    sapiens]
     P78363.3 RecName: Full=Retinal-specific phospholipid-transporting ATPase 
    ABCA4; AltName: Full=ATP-binding cassette sub-family A member 
    4; AltName: Full=RIM ABC transporter; Short=RIM proteinv; 
    Short=RmP; AltName: Full=Retinal-specific ATP-binding cassette 
    transporter; AltName: Full=Stargardt disease protein [Homo 
    sapiens]
     7LKP_A Chain A, Retinal-specific phospholipid-transporting ATPase ABCA4 
    [Homo sapiens]
     7M1P_A Chain A, Retinal-specific phospholipid-transporting ATPase ABCA4 
    [Homo sapiens]
     7M1Q_A Chain A, Retinal-specific phospholipid-transporting ATPase ABCA4 
    [Homo sapiens]
     EAW73056.1 ATP-binding cassette, sub-family A (ABC1), member 4, isoform 
    CRA_a [Homo sapiens]
    Length=2273
    
     Score = 4711 bits (12219),  Expect = 0.0, Method: Compositional matrix adjust.
     Identities = 2273/2273 (100%), Positives = 2273/2273 (100%), Gaps = 0/2273 (0%)
    
    >>>ENST00000460514_1
    >CAH10486.1 hypothetical protein [Homo sapiens]
    Length=1065
    
     Score = 301 bits (772),  Expect = 2e-96, Method: Compositional matrix adjust.
     Identities = 146/146 (100%), Positives = 146/146 (100%), Gaps = 0/146 (0%)
    
    >CAA75729.1 ABCR [Homo sapiens]
    Length=2273
    
     Score = 300 bits (769),  Expect = 2e-94, Method: Compositional matrix adjust.
     Identities = 146/146 (100%), Positives = 146/146 (100%), Gaps = 0/146 (0%)

所需的输出是:

Transcript        Protein                                                                 Length  Score             Identity  Percent
1 ENST00000370225_4 7E7I_A Chain A, Retinal-specific phospholipid-transporting ATPase ABCA4   2317 4711 bits (12220) 2273/2273       100%
2 ENST00000370225_4 NP_000341.2 retinal-specific phospholipid-transporting ATPase ABCA4       2273 4711 bits (12219) 2273/2273       100%
3 ENST00000460514_1 CAH10486.1 hypothetical protein                                           1065 301 bits (772)    146/146         100%
4 ENST00000460514_1 CAA75729.1 ABCR                                                           2273 300 bits (769)    146/146         100%

每个所需的列在原始文件中由长度、身份、分数、“>”和“>>>”等关键字分隔 我尝试了以下脚本,但我缺少添加成绩单(第一列),这是由“>>>”分隔的第一列。

my_txt <- readLines(con = "gene_filt_perc.txt")

transcript <-my_txt[grepl("^\s*>>>", my_txt)]
lengths <-my_txt[grepl("^\s*Length", my_txt)]
lengths <- gsub("Length=", "", lengths)
scores <-my_txt[grepl("^\s*Score", my_txt)]
scores <- gsub(" Score = ", "", scores)
scores <- gsub("\,  Expect = ..*", "", scores)
identities <-my_txt[grepl("^\s*Identities", my_txt)]
identities <- gsub(" Identities = ", "", identities)
identities <- gsub("\, Positives = ..*", "", identities)
protein <-my_txt[grepl("^\s*>[[:alnum:]]", my_txt)]
   
result <-data.frame("protein"=protein, "identities"=identities, "scores"=scores, "lengths"=lengths)
result
                                                                          protein       identities            scores lengths
1        7E7I_A Chain A, Retinal-specific phospholipid-transporting ATPase ABCA4  2273/2273 (100%) 4711 bits (12220)    2317
2        7LKZ_A Chain A, Retinal-specific phospholipid-transporting ATPase ABCA4   2271/2273 (99%) 4711 bits (12220)    2273
3      NP_000341.2 retinal-specific phospholipid-transporting ATPase ABCA4 [Homo  2273/2273 (100%) 4711 bits (12219)    2273
4                                 BAE06122.2 ABCA4 variant protein [Homo sapiens]  2272/2273 (99%) 4710 bits (12218)    2273
5        7E7Q_A Chain A, Retinal-specific phospholipid-transporting ATPase ABCA4   2271/2273 (99%) 4709 bits (12214)    2317

有没有更简单的方法来构造 data.frame?

对所有数据做gsub会破坏分层嵌套。将您的数据放在一起。从存储在 data.txt 中的一个大字符串开始,将它们拆分为每个转录本一个元素。可以使用相对表达式提取字段。使用 unnest 可以将列表转换为 table 中的多行。这会自动重复每个相应蛋白质的转录本 ID。

library(tidyverse)

read_file("data.txt") %>%
  str_split(">>>") %>%
  simplify() %>%
  discard(~ .x == "") %>%
  # parse all transcripts
  map(function(Transcript) {
    list(
      Transcript = Transcript %>% str_extract("ENST.*"),
      Protein = {
        Transcript %>%
        str_split(">") %>%
        simplify() %>%
        discard(~ .x %>% str_detect("^ENST"))
      }
    )
  }) %>%
  enframe() %>%
  select(value) %>%
  unnest_wider(value) %>%
  unnest(Protein) %>%
  mutate(
    # parse all proteins
    Protein = Protein %>% map(function(Protein) {
      list(
        # first line
        Protein = Protein %>% str_split("\n") %>% simplify() %>% first() %>% str_trim(),
        # numbers after pattern 'Length='
        Length = Protein %>% str_extract("(?<=Length=)[0-9 ]+") %>% as.numeric(),
        # numbers after pattern 'Score = '
        Score = Protein %>% str_extract("(?<=Score = )[0-9 ]+") %>% as.numeric()
      )
    })
  ) %>%
  unnest_wider(Protein)

#>  Transcript        Protein                                                                   Length Score
#>  <chr>             <chr>                                                                      <dbl> <dbl>
#>1 ENST00000370225_4 7E7I_A Chain A, Retinal-specific phospholipid-transporting ATPase ABCA4     2317  4711
#>2 ENST00000370225_4 NP_000341.2 retinal-specific phospholipid-transporting ATPase ABCA4 [Homo   2273  4711
#>3 ENST00000460514_1 CAH10486.1 hypothetical protein [Homo sapiens]                              1065   301
#>4 ENST00000460514_1 CAA75729.1 ABCR [Homo sapiens]                                              2273   300