从 R 中的注释 (#) 中提取值
Extract values from comments (#) in R
提前感谢这一切
所以,我有一个来自 .tsv 的数据框,类似于:
##ID_value=1829
##exportDate=1-18-2019
ChemID BasedMaterial State
MSO11D Oxygen Gas
GSX55E Carbon Liquid
对吗?所以,我只是想做的是添加一个名为 ID 的新列,其中填充了来自 ## 注释中的 ID_value 的值,以便得到如下内容:
ID ChemID BasedMaterial State
1829 MSO11D Oxygen Gas
1829 GSX55E Carbon Liquid
问题是,当我从 .tsv 导入时,我丢失了所有注释值,这很好,我实际上不希望它们出现在我的输出文件中(excel table ).但是通过这样做,我也丢失了那些对于公开目的有用的信息。
那么,有没有一种方法可以使用 ## 中的注释中的值来创建所述列,并删除这些注释以创建 table ?非常感谢
你可以试试下面的功能,解释在评论里。
func = function(FILE,COMMENTCHAR,VALUE){
allLines = readLines(FILE)
#exclude lines with comments
# and make table
tab = read.table(text=allLines[!grepl(COMMENTCHAR,allLines)],header=TRUE)
#find the line which has the value in comments
value = allLines[grepl(VALUE,allLines) & grepl(COMMENTCHAR,allLines)]
# we split to get the name and value
value = unlist(strsplit(gsub("#","",value),"="))
df = data.frame(value[2],tab)
colnames(df)[1] = value[1]
return(df)
}
主要思想是使用 readLines 获取所有内容。我们将没有注释的行转换为 table。从带有注释的行中,我们搜索您想要的值并将其作为第一列。我们在您的文本文件上尝试:
text=c("##ID_value=1829", "##exportDate=1-18-2019 ", "ChemID BasedMaterial State",
"MSO11D Oxygen Gas", "GSX55E Carbon Liquid"
)
writeLines(text,"test.txt")
func("test.txt","#","ID")
ID_value ChemID BasedMaterial State
1 1829 MSO11D Oxygen Gas
2 1829 GSX55E Carbon Liquid
虽然 StupidWolf 的答案有效,但我认为篡改经过验证的 read.table
的文件操作来代替解析文本通常不是一个好主意:随着文件变大,这会付出代价(据传闻,20 % 在 100k 行时增加,越大越多)。
如果已知模式位于顶部,则读取顶部几行,找到相关部分,然后对原始文件(使用原始参数)调用read.table
。
#' @param file 'character', the name of the file which the data are to be read from
#' @param ... other arguments passed to 'read.table'
#' @param meta_char 'character', the string (or pattern) that indicates a 'key=val' or 'note'
#' @param meta_rows 'integer', maximum number of rows to look for meta
#' @param meta_unnamed 'character', used for column-header of meta when no '=' is found
#' @param meta_skip_more 'integer', number of lines beyond the meta rows to skip for real data
#' @return 'data.frame', with any meta data augmented as columns
read_table_with_meta <- function(file, ...,
meta_char = "#", meta_rows = 10L, meta_unnamed = "meta",
meta_skip_more = 0L) {
toplines <- readLines(file, n = meta_rows)
meta_ptn <- paste0("^", meta_char)
dots <- list(...)
if ("skip" %in% names(dots)) {
warning("'skip' is determined by 'read_table_with_meta' and should not be assigned; if you need to skip more rows after meta rows, then use 'meta_skip_more'; 'skip=' is ignored here")
dots$skip <- NULL
}
if (all(grepl(meta_ptn, toplines))) {
stop("all lines looked like header rows, suggest you increase 'meta_rows'")
}
toplines <- toplines[ grepl(meta_ptn, toplines) ]
skip <- length(toplines) + meta_skip_more
toplines <- gsub(paste0("^", meta_char, "+\s*"), "", toplines)
if (length(toplines)) {
keys <- gsub("\s*=.*", "", toplines)
vals <- gsub("^[^=]*\s*=\s*", "", toplines)
unnamed <- (keys == vals)
keys[unnamed] <- paste0(meta_unnamed, seq_along(keys[unnamed]))
keyvals <- setNames(as.list(vals), keys)
} else keyvals <- NULL
dat <- do.call("read.table", c(list(file, skip = skip), dots))
if (is.null(keyvals)) dat else cbind(dat, keyvals)
}
备注:
这只搜索前 10 行(默认),认为你不应该在找到 non-commented-out 行后尝试解析整个文件;如果您有意见mid-file,则此回答不足;
此函数将所有这些行分配给字段;这可能不是处理此问题的最通用方法,但我认为它解决了您的要求;读入后,您可以丢弃不需要的字段;
unnamed
部分是为了防止并非所有 commented-out headers 中都有 =
;只是一个技巧,不确定它是否对你有必要或有用。
示范:
### safe with no-meta files
text=c("ChemID BasedMaterial State", "MSO11D Oxygen Gas", "GSX55E Carbon Liquid")
writeLines(text, "test.txt")
read_table_with_meta("test.txt", header=T)
# ChemID BasedMaterial State
# 1 MSO11D Oxygen Gas
# 2 GSX55E Carbon Liquid
### simple case
text=c("##ID_value=1829", "##exportDate=1-18-2019 ", "ChemID BasedMaterial State", "MSO11D Oxygen Gas", "GSX55E Carbon Liquid")
writeLines(text, "test.txt")
read_table_with_meta("test.txt", header = TRUE)
# ChemID BasedMaterial State ID_value exportDate
# 1 MSO11D Oxygen Gas 1829 1-18-2019
# 2 GSX55E Carbon Liquid 1829 1-18-2019
### unnamed meta
text=c("##ID_value=1829", "##exportDate=1-18-2019 ", "##somethingelse", "ChemID BasedMaterial State", "MSO11D Oxygen Gas", "GSX55E Carbon Liquid")
writeLines(text, "test.txt")
read_table_with_meta("test.txt", header = TRUE)
# ChemID BasedMaterial State ID_value exportDate meta1
# 1 MSO11D Oxygen Gas 1829 1-18-2019 somethingelse
# 2 GSX55E Carbon Liquid 1829 1-18-2019 somethingelse
### multiple unnamed meta
text=c("##ID_value=1829", "##exportDate=1-18-2019 ", "##somethingelse", "##key=val", "##more", "ChemID BasedMaterial State", "MSO11D Oxygen Gas", "GSX55E Carbon Liquid")
writeLines(text, "test.txt")
read_table_with_meta("test.txt", header = TRUE)
# ChemID BasedMaterial State ID_value exportDate meta1 key meta2
# 1 MSO11D Oxygen Gas 1829 1-18-2019 somethingelse val more
# 2 GSX55E Carbon Liquid 1829 1-18-2019 somethingelse val more
提前感谢这一切
所以,我有一个来自 .tsv 的数据框,类似于:
##ID_value=1829
##exportDate=1-18-2019
ChemID BasedMaterial State
MSO11D Oxygen Gas
GSX55E Carbon Liquid
对吗?所以,我只是想做的是添加一个名为 ID 的新列,其中填充了来自 ## 注释中的 ID_value 的值,以便得到如下内容:
ID ChemID BasedMaterial State
1829 MSO11D Oxygen Gas
1829 GSX55E Carbon Liquid
问题是,当我从 .tsv 导入时,我丢失了所有注释值,这很好,我实际上不希望它们出现在我的输出文件中(excel table ).但是通过这样做,我也丢失了那些对于公开目的有用的信息。
那么,有没有一种方法可以使用 ## 中的注释中的值来创建所述列,并删除这些注释以创建 table ?非常感谢
你可以试试下面的功能,解释在评论里。
func = function(FILE,COMMENTCHAR,VALUE){
allLines = readLines(FILE)
#exclude lines with comments
# and make table
tab = read.table(text=allLines[!grepl(COMMENTCHAR,allLines)],header=TRUE)
#find the line which has the value in comments
value = allLines[grepl(VALUE,allLines) & grepl(COMMENTCHAR,allLines)]
# we split to get the name and value
value = unlist(strsplit(gsub("#","",value),"="))
df = data.frame(value[2],tab)
colnames(df)[1] = value[1]
return(df)
}
主要思想是使用 readLines 获取所有内容。我们将没有注释的行转换为 table。从带有注释的行中,我们搜索您想要的值并将其作为第一列。我们在您的文本文件上尝试:
text=c("##ID_value=1829", "##exportDate=1-18-2019 ", "ChemID BasedMaterial State",
"MSO11D Oxygen Gas", "GSX55E Carbon Liquid"
)
writeLines(text,"test.txt")
func("test.txt","#","ID")
ID_value ChemID BasedMaterial State
1 1829 MSO11D Oxygen Gas
2 1829 GSX55E Carbon Liquid
虽然 StupidWolf 的答案有效,但我认为篡改经过验证的 read.table
的文件操作来代替解析文本通常不是一个好主意:随着文件变大,这会付出代价(据传闻,20 % 在 100k 行时增加,越大越多)。
如果已知模式位于顶部,则读取顶部几行,找到相关部分,然后对原始文件(使用原始参数)调用read.table
。
#' @param file 'character', the name of the file which the data are to be read from
#' @param ... other arguments passed to 'read.table'
#' @param meta_char 'character', the string (or pattern) that indicates a 'key=val' or 'note'
#' @param meta_rows 'integer', maximum number of rows to look for meta
#' @param meta_unnamed 'character', used for column-header of meta when no '=' is found
#' @param meta_skip_more 'integer', number of lines beyond the meta rows to skip for real data
#' @return 'data.frame', with any meta data augmented as columns
read_table_with_meta <- function(file, ...,
meta_char = "#", meta_rows = 10L, meta_unnamed = "meta",
meta_skip_more = 0L) {
toplines <- readLines(file, n = meta_rows)
meta_ptn <- paste0("^", meta_char)
dots <- list(...)
if ("skip" %in% names(dots)) {
warning("'skip' is determined by 'read_table_with_meta' and should not be assigned; if you need to skip more rows after meta rows, then use 'meta_skip_more'; 'skip=' is ignored here")
dots$skip <- NULL
}
if (all(grepl(meta_ptn, toplines))) {
stop("all lines looked like header rows, suggest you increase 'meta_rows'")
}
toplines <- toplines[ grepl(meta_ptn, toplines) ]
skip <- length(toplines) + meta_skip_more
toplines <- gsub(paste0("^", meta_char, "+\s*"), "", toplines)
if (length(toplines)) {
keys <- gsub("\s*=.*", "", toplines)
vals <- gsub("^[^=]*\s*=\s*", "", toplines)
unnamed <- (keys == vals)
keys[unnamed] <- paste0(meta_unnamed, seq_along(keys[unnamed]))
keyvals <- setNames(as.list(vals), keys)
} else keyvals <- NULL
dat <- do.call("read.table", c(list(file, skip = skip), dots))
if (is.null(keyvals)) dat else cbind(dat, keyvals)
}
备注:
这只搜索前 10 行(默认),认为你不应该在找到 non-commented-out 行后尝试解析整个文件;如果您有意见mid-file,则此回答不足;
此函数将所有这些行分配给字段;这可能不是处理此问题的最通用方法,但我认为它解决了您的要求;读入后,您可以丢弃不需要的字段;
unnamed
部分是为了防止并非所有 commented-out headers 中都有=
;只是一个技巧,不确定它是否对你有必要或有用。
示范:
### safe with no-meta files
text=c("ChemID BasedMaterial State", "MSO11D Oxygen Gas", "GSX55E Carbon Liquid")
writeLines(text, "test.txt")
read_table_with_meta("test.txt", header=T)
# ChemID BasedMaterial State
# 1 MSO11D Oxygen Gas
# 2 GSX55E Carbon Liquid
### simple case
text=c("##ID_value=1829", "##exportDate=1-18-2019 ", "ChemID BasedMaterial State", "MSO11D Oxygen Gas", "GSX55E Carbon Liquid")
writeLines(text, "test.txt")
read_table_with_meta("test.txt", header = TRUE)
# ChemID BasedMaterial State ID_value exportDate
# 1 MSO11D Oxygen Gas 1829 1-18-2019
# 2 GSX55E Carbon Liquid 1829 1-18-2019
### unnamed meta
text=c("##ID_value=1829", "##exportDate=1-18-2019 ", "##somethingelse", "ChemID BasedMaterial State", "MSO11D Oxygen Gas", "GSX55E Carbon Liquid")
writeLines(text, "test.txt")
read_table_with_meta("test.txt", header = TRUE)
# ChemID BasedMaterial State ID_value exportDate meta1
# 1 MSO11D Oxygen Gas 1829 1-18-2019 somethingelse
# 2 GSX55E Carbon Liquid 1829 1-18-2019 somethingelse
### multiple unnamed meta
text=c("##ID_value=1829", "##exportDate=1-18-2019 ", "##somethingelse", "##key=val", "##more", "ChemID BasedMaterial State", "MSO11D Oxygen Gas", "GSX55E Carbon Liquid")
writeLines(text, "test.txt")
read_table_with_meta("test.txt", header = TRUE)
# ChemID BasedMaterial State ID_value exportDate meta1 key meta2
# 1 MSO11D Oxygen Gas 1829 1-18-2019 somethingelse val more
# 2 GSX55E Carbon Liquid 1829 1-18-2019 somethingelse val more