readr:阅读需要合并定界符的最佳实践?

readr: best practice for reading where one needs to merge delimiters?

有时人们为了便于阅读而在文件中重复空格。然而,readr'sread_delim似乎无法处理这个用例。

PLINK 的示例输出:

 FID          IID  PHENO    CNT   CNT2    SCORE
   0   ERR1136327     -9   2000    417 -0.000263553
   0   ERR1136328     -9   2808    755 -0.000119435
   0   ERR1136329     -9   1026    242 8.63494e-05
   0   ERR1136330     -9   2688    880 0.000517726
   0   ERR1136331     -9   1868    567 0.000264016
   0   ERR1136332     -9   3522   1368 0.000144985

(最初几行)

尝试阅读 read_delim:

> d = read_delim("data/no_vcf_filtering/plink.profile", delim = " ")
Missing column names filled in: 'X1' [1], 'X3' [3], 'X4' [4], 'X5' [5], 'X6' [6], 'X7' [7], 'X8' [8], 'X9' [9], 'X10' [10], 'X11' [11], 'X13' [13], 'X15' [15], 'X16' [16], 'X17' [17], 'X19' [19], 'X20' [20], 'X22' [22], 'X23' [23], 'X24' [24]Parsed with column specification:
cols(
  .default = col_character(),
  X4 = col_integer(),
  IID = col_integer(),
  X15 = col_integer(),
  X16 = col_integer(),
  X17 = col_integer(),
  CNT = col_integer(),
  X19 = col_double(),
  X20 = col_double(),
  CNT2 = col_double(),
  X22 = col_double(),
  X23 = col_double(),
  X24 = col_double(),
  SCORE = col_double()
)
See spec(...) for full column specifications.
number of columns of result is not a multiple of vector length (arg 1)215 parsing failures.
row # A tibble: 5 x 5 col     row   col   expected     actual                                  file expected   <int> <chr>      <chr>      <chr>                                 <chr> actual 1     1  <NA> 25 columns 20 columns 'data/no_vcf_filtering/plink.profile' file 2     2  <NA> 25 columns 20 columns 'data/no_vcf_filtering/plink.profile' row 3     3  <NA> 25 columns 20 columns 'data/no_vcf_filtering/plink.profile' col 4     4  <NA> 25 columns 20 columns 'data/no_vcf_filtering/plink.profile' expected 5     5  <NA> 25 columns 20 columns 'data/no_vcf_filtering/plink.profile'
... ................. ... ......................................................................... ........ ......................................................................... ...... ......................................................................... .... ......................................................................... ... ......................................................................... ... ......................................................................... ........ .........................................................................
See problems(...) for more details.

此处明显的解决方案不起作用:

d = read_delim("data/no_vcf_filtering/plink.profile", delim = " +")
Parsed with column specification:
cols(
  .default = col_character(),
  X4 = col_integer(),
  IID = col_integer(),
  X15 = col_integer(),
  X16 = col_integer(),
  X17 = col_integer(),
#etc.

我找到了一个迂回的解决方案,方法是将空格转换为制表符,用换行符连接行,然后读取为 tsv(在这种情况下去掉第一个空列),但它不应该这很难。我是否遗漏了一些明显的东西?

> read_lines("data/no_vcf_filtering/plink.profile") %>% str_replace_all(" +", "\t") %>% str_c(collapse = "\n") %>% read_tsv() %>% .[, -1]
# A tibble: 230 x 6
     FID        IID PHENO   CNT  CNT2        SCORE
   <int>      <chr> <int> <int> <int>        <dbl>
 1     0 ERR1136327    -9  2000   417 -2.63553e-04
 2     0 ERR1136328    -9  2808   755 -1.19435e-04
 3     0 ERR1136329    -9  1026   242  8.63494e-05
 4     0 ERR1136330    -9  2688   880  5.17726e-04
 5     0 ERR1136331    -9  1868   567  2.64016e-04
 6     0 ERR1136332    -9  3522  1368  1.44985e-04
 7     0 ERR1136333    -9   870   110 -1.25087e-04
 8     0 ERR1136334    -9  2936   877 -6.35191e-04
 9     0 ERR1136335    -9  3048   914 -2.22427e-06
10     0 ERR1136336    -9  3184   814  2.77346e-04
# ... with 220 more rows
Warning message:
Missing column names filled in: 'X1' [1]

readr::read_table是适合这种格式的函数。

> read_table("test.txt")
Parsed with column specification:
cols(
  FID = col_integer(),
  IID = col_character(),
  PHENO = col_integer(),
  CNT = col_integer(),
  CNT2 = col_integer(),
  SCORE = col_double()
)
# A tibble: 6 x 6
    FID        IID PHENO   CNT  CNT2        SCORE
  <int>      <chr> <int> <int> <int>        <dbl>
1     0 ERR1136327    -9  2000   417 -2.63553e-04
2     0 ERR1136328    -9  2808   755 -1.19435e-04
3     0 ERR1136329    -9  1026   242  8.63494e-05
4     0 ERR1136330    -9  2688   880  5.17726e-04
5     0 ERR1136331    -9  1868   567  2.64016e-04
6     0 ERR1136332    -9  3522  1368  1.44985e-04

base:: 函数也是如此 - read.table vs read.delim.