用 fread 跳过一些行

Skip some lines with fread

我有兴趣在 header 名称之前跳过我的数据框中的一些行。我如何通过跳过 ID_REF 之前的所有行或者如果 ID_REF 不存在,检查模式 ILMN_ 并删除所有行,如果不包含 [=14 则首先保持立即=].

# GEOarchive matrix file.               
ID_REF  1688628068_A.AVG_Signal 1688628068_A.Avg_NBEADS 1688628068_A.BEAD_STDERR 1688628068_A.Detection Pval
ILMN_1343291    62821.84         135                               413.9399                       0
ILMN_1343292    3255.167         131                               47.76587                       0
ILMN_1343293    42924.91         152                               539.3026                       0
ILMN_1343294    55255.21         100                               746.1457                       0

在 linux 中,您可以将 awkfread 一起使用,也可以将其与 read.table 一起使用。在这里,我使用 awk

将分隔符更改为 ,
pth <- '/home/akrun/file.txt' #change it to your path
v1 <- sprintf("awk '/^(ID_REF|LMN)/{ matched = 1} matched {=; print}' OFS=\",\" %s", pth)

并阅读fread

library(data.table)
fread(v1)
#         ID_REF 1688628068_A.AVG_Signal 1688628068_A.Avg_NBEADS
#1: ILMN_1343291               62821.840                     135
#2: ILMN_1343292                3255.167                     131
#3: ILMN_1343293               42924.910                     152
#4: ILMN_1343294               55255.210                     100
#   1688628068_A.BEAD_STDERR 1688628068_A.Detection_Pval
#1:                413.93990                           0
#2:                 47.76587                           0
#3:                539.30260                           0
#4:                746.14570                           0

或使用read.table

read.table(pipe(v1), header=TRUE, sep=',', check.names=FALSE)
#       ID_REF 1688628068_A.AVG_Signal 1688628068_A.Avg_NBEADS
#1 ILMN_1343291               62821.840                     135
#2 ILMN_1343292                3255.167                     131
#3 ILMN_1343293               42924.910                     152
#4 ILMN_1343294               55255.210                     100
#  1688628068_A.BEAD_STDERR 1688628068_A.Detection_Pval
#1                413.93990                           0
#2                 47.76587                           0
#3                539.30260                           0
#4                746.14570                           0

注意:我将列名从 1688628068_A.Detection Pval 更改为 1688628068_A.Detection_Pval

出于某种原因,额外的空格给 fread 带来了问题。使用 read.table 这不是问题。因此,以下内容也适用于 read.table

 v2 <- sprintf("awk '/^(ID_REF|ILMN)/{ matched = 1} matched { print}' %s", pth)

 read.table(pipe(v2), header=TRUE, check.names=FALSE)
 #       ID_REF 1688628068_A.AVG_Signal 1688628068_A.Avg_NBEADS
 #1 ILMN_1343291               62821.840                     135
 #2 ILMN_1343292                3255.167                     131
 #3 ILMN_1343293               42924.910                     152
 #4 ILMN_1343294               55255.210                     100
 #  1688628068_A.BEAD_STDERR 1688628068_A.Detection_Pval
 #1                413.93990                           0
 #2                 47.76587                           0
 #3                539.30260                           0
 #4                746.14570                           0