如何用每 n(可变)行 headers 整理固定宽度的文件?
How to tidy a fixed width file with headers every n (varies) rows?
我在 fixed-width-file 中有时间序列数据,其中观察行(n 根据样本大小而变化)出现在 "header" 行下,其中包含重要的元数据(即样本编号,日期等)。两种类型的行都包含字母数字字符。它看起来像这样(为便于阅读缩短了字符串:
4 64001416230519844TP blahblah
5416001130 1 F 492273
5416001140 3 F 492274
5416001145 1 F 492275
5416001150 19 F 492276
5416001155 21 F 492277
5416001160 21 F 492278
5416001165 13 F 492279
5416001170 3 F 492280
5416001180 1 F 492281
4 64001544250619844RA blahblah
5544001125 1 F 492291
5544001130 3 F 492292
5544001135 4 F 492293
5544001140 11 F 492294
5544001145 13 F 492295
4 64002544250619844RA blahblah
etc.
Header 行以字符串中的第一个字符 == 4 来区分,共有 89 个字符。观察行 == 5 并且有 24 个字符。
我想要的是将 header 行粘贴到每个后续观察行(数据的子集),以便我稍后可以用 read_fwf 解析字符串并确保我可以根据 header 行中包含的信息对每个观察结果进行排序。我不在乎原始 header 行是否被删除。像这样:
5416001130 1 F 492273 4 64001416230519844TP blahblah
5416001140 3 F 492274 4 64001416230519844TP blahblah
5416001145 1 F 492275 4 64001416230519844TP blahblah
5416001150 19 F 492276 4 64001416230519844TP blahblah
5416001155 21 F 492277 4 64001416230519844TP blahblah
5416001160 21 F 492278 4 64001416230519844TP blahblah
5416001165 13 F 492279 4 64001416230519844TP blahblah
5416001170 3 F 492280 4 64001416230519844TP blahblah
5416001180 1 F 492281 4 64001416230519844TP blahblah
5544001125 1 F 492291 4 64001544250619844RA blahblah
5544001130 3 F 492292 4 64001544250619844RA blahblah
5544001135 4 F 492293 4 64001544250619844RA blahblah
5544001140 11 F 492294 4 64001544250619844RA blahblah
5544001145 13 F 492295 4 64001544250619844RA blahblah
etc...
我找到的最接近的解决方案在这里fwf file with headers every 5th row, headers were characters and observations numeric
提供的解决方案是一个循环,它迭代地滚动行并测试它们是字符还是数字,然后将它们相应地粘贴在一起。
text <- readLines('/path/to/file') # read in the file
split_text <- strsplit(text, "\s+") # split each line on whitespace
for (line in split_text) { # iterate through lines
numeric_line <- suppressWarnings(as.numeric(line)) # try to convert the current line into a vector of numbers
if (is.na(numeric_line[[1]])) { # if it fails, we know we're on a header line
header <- line
} else {
for (i in seq(1, length(line), 2)) { # otherwise, we're on a data line, so take two numbers at once
print(c(header, line[[i]], line[[i+1]])) # and output the latest header with each pair of values
}
}
}
我试图通过首先使用 read.fwf() 或 read_fwf() 读取 fwf 并将第一个字符定义为一列以区分 header 来使其适应我的数据s 和观察:
packages = c('tidyverse','rgdal','car','audio','beepr','xlsx','magrittr','lubridate','RColorBrewer','haven')
invisible(lapply(packages, function(x) {if (!require(x, character.only = T)) {install.packages(x);require(x)}}))
DF <- read.fwf("directory/.dat", widths = c(1, 88), header = FALSE)
我的改编:
newdf <- for (i in DF) { # iterate through lines
if (DF$V1 == 4) { # if true, we know we're on a header row
header <- i
} else {
for (i in seq(1, length(DF$V2), 1)) { # otherwise = observation row
print(c(header, DF$V2[[i]], DF$V2[[i+1]])) # and output the latest header with each observation until you hit another header
}
}
}
#this is very slow and/or does not work
# I get the following error message
#Warning messages:
1: In if (DF$V1 == 4) { :
the condition has length > 1 and only the first element will be used
我还尝试通过 nchar() 听众 =89 和观察 =24 指定 headers 与观察行。
我意识到这里的循环解决方案可能是使用 ifelse 但出现了另一个问题。
数据集长约 39700 行,我一直在获取新数据。循环会花费很长时间...
我想用 data.table 或 dplyr 语法来做到这一点。
我试过按照这些帖子玩 dplyr::lag:
和 并接近我想要的:
newdf<-DF %>%
mutate(new = replace(lag(V2), V1 != '5', NA))
但如您所见,新列只粘贴了前一行的信息...正如 lag() 应该做的那样。
任何帮助将不胜感激,提前致谢。
作为旁注。这些数据以前是在 SAS 中处理过的,但由于我不使用 SAS,所以你去吧。如果有帮助,我确实有 SAS 代码:
DATA A1;
FILENAME FREQLONG 'dir/FL.DAT';
INFILE FREQLONG;
INPUT
TYPE 1 @ ;
IF TYPE EQ 4 THEN LINK LIGNE4;
IF TYPE EQ 5 THEN DELETE;
RETURN;
LIGNE4:
INPUT var1 $ 6 - 8
var2 $ 9 - 11
var3 12 - 13
var4 14 - 15
var5 18 - 19
var6 $ 20 - 22
var7 $ 44 - 46
var8 $ 78;
DATA A2;
FILENAME FREQLONG 'dir/FL.DAT';
INFILE FREQLONG;
INPUT
TYPE 1 @ ;
IF TYPE EQ 4 THEN DELETE;
IF TYPE EQ 5 THEN LINK LIGNE5;
RETURN;
LIGNE5:
INPUT var1 $ 5 - 7
var2 $ 2 - 4
varz 8 - 10
vara 11 - 13
varb $ 15;
DATA A3;
SET A1;
PROC SORT;
BY var1 var2;
RUN;
DATA A4;
SET A2;
PROC SORT;
BY var1 var2;
RUN;
DATA A5;
MERGE A4 A3;
BY var1 var2;
RUN;
如您所见,它拆分了文件,对变量进行排序,然后合并它们。然而,这是逐年完成的,我想多年来一直使用一个文件。
这是使用 tidyverse
的解决方案。
它创建一个只有 header 行的新列,然后用上面的 header 填充没有 header 的行。最后,如果需要,您可以 paste
将这些列放在一起。
x <- read.table(text = "4 64001416230519844TP blahblah
5416001130 1 F 492273
5416001140 3 F 492274
5416001145 1 F 492275
5416001150 19 F 492276
5416001155 21 F 492277
5416001160 21 F 492278
5416001165 13 F 492279
5416001170 3 F 492280
5416001180 1 F 492281
4 64001544250619844RA blahblah
5544001125 1 F 492291
5544001130 3 F 492292
5544001135 4 F 492293
5544001140 11 F 492294
5544001145 13 F 492295", header = FALSE, sep = "\t")
library("tidyverse")
x %>%
rename(body = V1) %>%
mutate(
body = trimws(body),
head = if_else(grepl("^4", body), body, NA_character_),
body = if_else(is.na(head), body, NA_character_)
) %>%
fill(head, .direction = "down") %>%
filter(!is.na(body))
输出
body head
1 5416001130 1 F 492273 4 64001416230519844TP blahblah
2 5416001140 3 F 492274 4 64001416230519844TP blahblah
3 5416001145 1 F 492275 4 64001416230519844TP blahblah
4 5416001150 19 F 492276 4 64001416230519844TP blahblah
5 5416001155 21 F 492277 4 64001416230519844TP blahblah
6 5416001160 21 F 492278 4 64001416230519844TP blahblah
7 5416001165 13 F 492279 4 64001416230519844TP blahblah
8 5416001170 3 F 492280 4 64001416230519844TP blahblah
9 5416001180 1 F 492281 4 64001416230519844TP blahblah
10 5544001125 1 F 492291 4 64001544250619844RA blahblah
11 5544001130 3 F 492292 4 64001544250619844RA blahblah
12 5544001135 4 F 492293 4 64001544250619844RA blahblah
13 5544001140 11 F 492294 4 64001544250619844RA blahblah
14 5544001145 13 F 492295 4 64001544250619844RA blahblah
另一个可能的解决方案(没有 tidyverse)是每行读入文件,查找 header 行并将这些行粘贴到没有 header 的行的末尾。之后,这些行被拆分并放入 data.frame.
lines <- readLines("asd.dat")
# last index + 1 for iteration
headers <- c(which(grepl("^4 ", lines)), length(lines) + 1)
pastedLines <- c()
for(i in 1:(length(headers) - 1)) {
pastedLines <- c(pastedLines,
paste(lines[(headers[i] + 1) : (headers[i + 1] - 1)], lines[headers[i]]))
}
DF <- as.data.frame(matrix(unlist(strsplit(pastedLines, "\s+")), nrow = length(pastedLines), byrow=T))
输出:
V1 V2 V3 V4 V5 V6 V7
1 5416001130 1 F 492273 4 64001416230519844TP blahblah
2 5416001140 3 F 492274 4 64001416230519844TP blahblah
3 5416001145 1 F 492275 4 64001416230519844TP blahblah
4 5416001150 19 F 492276 4 64001416230519844TP blahblah
5 5416001155 21 F 492277 4 64001416230519844TP blahblah
6 5416001160 21 F 492278 4 64001416230519844TP blahblah
7 5416001165 13 F 492279 4 64001416230519844TP blahblah
8 5416001170 3 F 492280 4 64001416230519844TP blahblah
9 5416001180 1 F 492281 4 64001416230519844TP blahblah
10 5544001125 1 F 492291 4 64001544250619844RA blahblah
11 5544001130 3 F 492292 4 64001544250619844RA blahblah
12 5544001135 4 F 492293 4 64001544250619844RA blahblah
13 5544001140 11 F 492294 4 64001544250619844RA blahblah
14 5544001145 13 F 492295 4 64001544250619844RA blahblah
两个基于 R 的选项。都使用 readLines
读取原始文本数据(请参阅本答案的末尾)。
选项 1:
i <- grepl(pattern = '^4 ', x)
x1 <- strsplit(x[!i], '\s+')
x2 <- strsplit(x[i], '\s+')
d1 <- do.call(rbind.data.frame, x1)
d2 <- do.call(rbind.data.frame, x2)
d <- cbind(d1, d2[cumsum(i)[-which(i)],])
names(d) <- paste0('V',1:ncol(d))
给出:
> d
V1 V2 V3 V4 V5 V6 V7
1 5416001130 1 F 492273 4 64001416230519844TP blahblah
1.1 5416001140 3 F 492274 4 64001416230519844TP blahblah
1.2 5416001145 1 F 492275 4 64001416230519844TP blahblah
1.3 5416001150 19 F 492276 4 64001416230519844TP blahblah
1.4 5416001155 21 F 492277 4 64001416230519844TP blahblah
1.5 5416001160 21 F 492278 4 64001416230519844TP blahblah
1.6 5416001165 13 F 492279 4 64001416230519844TP blahblah
1.7 5416001170 3 F 492280 4 64001416230519844TP blahblah
1.8 5416001180 1 F 492281 4 64001416230519844TP blahblah
2 5544001125 1 F 492291 4 64001544250619844RA blahblah
2.1 5544001130 3 F 492292 4 64001544250619844RA blahblah
2.2 5544001135 4 F 492293 4 64001544250619844RA blahblah
2.3 5544001140 11 F 492294 4 64001544250619844RA blahblah
2.4 5544001145 13 F 492295 4 64001544250619844RA blahblah
选项 2:
rawlist <- split(x, cumsum(grepl(pattern = '^4 ', x)))
l1 <- lapply(rawlist, function(x) read.table(text = x, skip = 1, header = FALSE))
l2 <- lapply(rawlist, function(x) read.table(text = x, nrows = 1, header = FALSE))
reps <- sapply(l1, nrow)
d1 <- do.call(rbind, l1)
d2 <- do.call(rbind, l2)[rep(1:length(l2), reps),]
d <- cbind(d1, d2)
names(d) <- paste0('V',1:ncol(d))
给出:
> d
V1 V2 V3 V4 V5 V6 V7
1.1 5416001130 1 FALSE 492273 4 64001416230519844TP blahblah
1.2 5416001140 3 FALSE 492274 4 64001416230519844TP blahblah
1.3 5416001145 1 FALSE 492275 4 64001416230519844TP blahblah
1.4 5416001150 19 FALSE 492276 4 64001416230519844TP blahblah
1.5 5416001155 21 FALSE 492277 4 64001416230519844TP blahblah
1.6 5416001160 21 FALSE 492278 4 64001416230519844TP blahblah
1.7 5416001165 13 FALSE 492279 4 64001416230519844TP blahblah
1.8 5416001170 3 FALSE 492280 4 64001416230519844TP blahblah
1.9 5416001180 1 FALSE 492281 4 64001416230519844TP blahblah
2.1 5544001125 1 FALSE 492291 4 64001544250619844RA blahblah
2.2 5544001130 3 FALSE 492292 4 64001544250619844RA blahblah
2.3 5544001135 4 FALSE 492293 4 64001544250619844RA blahblah
2.4 5544001140 11 FALSE 492294 4 64001544250619844RA blahblah
2.5 5544001145 13 FALSE 492295 4 64001544250619844RA blahblah
已用数据:
x <- readLines(textConnection('4 64001416230519844TP blahblah
5416001130 1 F 492273
5416001140 3 F 492274
5416001145 1 F 492275
5416001150 19 F 492276
5416001155 21 F 492277
5416001160 21 F 492278
5416001165 13 F 492279
5416001170 3 F 492280
5416001180 1 F 492281
4 64001544250619844RA blahblah
5544001125 1 F 492291
5544001130 3 F 492292
5544001135 4 F 492293
5544001140 11 F 492294
5544001145 13 F 492295'))
要读取您的实际数据,您可以使用如下内容:
x <- readLine('name-of-datafile.txt')
这是一个可能的基础 R 解决方案, 试图 提高内存效率:
rawtext <- "4 64001416230519844TP blahblah
5416001130 1 F 492273
5416001140 3 F 492274
5416001145 1 F 492275
5416001150 19 F 492276
5416001155 21 F 492277
5416001160 21 F 492278
5416001165 13 F 492279
5416001170 3 F 492280
5416001180 1 F 492281
4 64001544250619844RA blahblah
5544001125 1 F 492291
5544001130 3 F 492292
5544001135 4 F 492293
5544001140 11 F 492294
5544001145 13 F 492295"
先读取一次数据,得到header行号。请注意,这可以使用命令行实用程序来完成,例如... grep
,在 R:
之外
text <- readLines(textConnection(rawtext))
header_rows <- grep("^4", text)
lengths <- diff(c(header_rows, length(text) + 1)) - 1
rm(text)
然后实际上 re-read 每件,但只有必要的行数:
do.call(rbind, mapply(
function(skip, nrows, ...) data.frame(
read.table(skip = skip, nrows = nrows, ...),
read.table(skip = skip - 1, nrows = 1, ...)
),
MoreArgs = list(text = rawtext),
header_rows,
lengths,
SIMPLIFY = FALSE
))
# V1 V2 V3 V4 V1.1 V2.1 V3.1
# 1 5416001130 1 FALSE 492273 4 64001416230519844TP blahblah
# 2 5416001140 3 FALSE 492274 4 64001416230519844TP blahblah
# 3 5416001145 1 FALSE 492275 4 64001416230519844TP blahblah
# 4 5416001150 19 FALSE 492276 4 64001416230519844TP blahblah
# 5 5416001155 21 FALSE 492277 4 64001416230519844TP blahblah
# 6 5416001160 21 FALSE 492278 4 64001416230519844TP blahblah
# 7 5416001165 13 FALSE 492279 4 64001416230519844TP blahblah
# 8 5416001170 3 FALSE 492280 4 64001416230519844TP blahblah
# 9 5416001180 1 FALSE 492281 4 64001416230519844TP blahblah
# 10 5544001125 1 FALSE 492291 4 64001544250619844RA blahblah
# 11 5544001130 3 FALSE 492292 4 64001544250619844RA blahblah
# 12 5544001135 4 FALSE 492293 4 64001544250619844RA blahblah
# 13 5544001140 11 FALSE 492294 4 64001544250619844RA blahblah
# 14 5544001145 13 FALSE 492295 4 64001544250619844RA blahblah
我在 fixed-width-file 中有时间序列数据,其中观察行(n 根据样本大小而变化)出现在 "header" 行下,其中包含重要的元数据(即样本编号,日期等)。两种类型的行都包含字母数字字符。它看起来像这样(为便于阅读缩短了字符串:
4 64001416230519844TP blahblah
5416001130 1 F 492273
5416001140 3 F 492274
5416001145 1 F 492275
5416001150 19 F 492276
5416001155 21 F 492277
5416001160 21 F 492278
5416001165 13 F 492279
5416001170 3 F 492280
5416001180 1 F 492281
4 64001544250619844RA blahblah
5544001125 1 F 492291
5544001130 3 F 492292
5544001135 4 F 492293
5544001140 11 F 492294
5544001145 13 F 492295
4 64002544250619844RA blahblah
etc.
Header 行以字符串中的第一个字符 == 4 来区分,共有 89 个字符。观察行 == 5 并且有 24 个字符。
我想要的是将 header 行粘贴到每个后续观察行(数据的子集),以便我稍后可以用 read_fwf 解析字符串并确保我可以根据 header 行中包含的信息对每个观察结果进行排序。我不在乎原始 header 行是否被删除。像这样:
5416001130 1 F 492273 4 64001416230519844TP blahblah
5416001140 3 F 492274 4 64001416230519844TP blahblah
5416001145 1 F 492275 4 64001416230519844TP blahblah
5416001150 19 F 492276 4 64001416230519844TP blahblah
5416001155 21 F 492277 4 64001416230519844TP blahblah
5416001160 21 F 492278 4 64001416230519844TP blahblah
5416001165 13 F 492279 4 64001416230519844TP blahblah
5416001170 3 F 492280 4 64001416230519844TP blahblah
5416001180 1 F 492281 4 64001416230519844TP blahblah
5544001125 1 F 492291 4 64001544250619844RA blahblah
5544001130 3 F 492292 4 64001544250619844RA blahblah
5544001135 4 F 492293 4 64001544250619844RA blahblah
5544001140 11 F 492294 4 64001544250619844RA blahblah
5544001145 13 F 492295 4 64001544250619844RA blahblah
etc...
我找到的最接近的解决方案在这里fwf file with headers every 5th row, headers were characters and observations numeric
提供的解决方案是一个循环,它迭代地滚动行并测试它们是字符还是数字,然后将它们相应地粘贴在一起。
text <- readLines('/path/to/file') # read in the file
split_text <- strsplit(text, "\s+") # split each line on whitespace
for (line in split_text) { # iterate through lines
numeric_line <- suppressWarnings(as.numeric(line)) # try to convert the current line into a vector of numbers
if (is.na(numeric_line[[1]])) { # if it fails, we know we're on a header line
header <- line
} else {
for (i in seq(1, length(line), 2)) { # otherwise, we're on a data line, so take two numbers at once
print(c(header, line[[i]], line[[i+1]])) # and output the latest header with each pair of values
}
}
}
我试图通过首先使用 read.fwf() 或 read_fwf() 读取 fwf 并将第一个字符定义为一列以区分 header 来使其适应我的数据s 和观察:
packages = c('tidyverse','rgdal','car','audio','beepr','xlsx','magrittr','lubridate','RColorBrewer','haven')
invisible(lapply(packages, function(x) {if (!require(x, character.only = T)) {install.packages(x);require(x)}}))
DF <- read.fwf("directory/.dat", widths = c(1, 88), header = FALSE)
我的改编:
newdf <- for (i in DF) { # iterate through lines
if (DF$V1 == 4) { # if true, we know we're on a header row
header <- i
} else {
for (i in seq(1, length(DF$V2), 1)) { # otherwise = observation row
print(c(header, DF$V2[[i]], DF$V2[[i+1]])) # and output the latest header with each observation until you hit another header
}
}
}
#this is very slow and/or does not work
# I get the following error message
#Warning messages:
1: In if (DF$V1 == 4) { :
the condition has length > 1 and only the first element will be used
我还尝试通过 nchar() 听众 =89 和观察 =24 指定 headers 与观察行。 我意识到这里的循环解决方案可能是使用 ifelse 但出现了另一个问题。 数据集长约 39700 行,我一直在获取新数据。循环会花费很长时间...
我想用 data.table 或 dplyr 语法来做到这一点。
我试过按照这些帖子玩 dplyr::lag:
newdf<-DF %>%
mutate(new = replace(lag(V2), V1 != '5', NA))
但如您所见,新列只粘贴了前一行的信息...正如 lag() 应该做的那样。
任何帮助将不胜感激,提前致谢。
作为旁注。这些数据以前是在 SAS 中处理过的,但由于我不使用 SAS,所以你去吧。如果有帮助,我确实有 SAS 代码:
DATA A1;
FILENAME FREQLONG 'dir/FL.DAT';
INFILE FREQLONG;
INPUT
TYPE 1 @ ;
IF TYPE EQ 4 THEN LINK LIGNE4;
IF TYPE EQ 5 THEN DELETE;
RETURN;
LIGNE4:
INPUT var1 $ 6 - 8
var2 $ 9 - 11
var3 12 - 13
var4 14 - 15
var5 18 - 19
var6 $ 20 - 22
var7 $ 44 - 46
var8 $ 78;
DATA A2;
FILENAME FREQLONG 'dir/FL.DAT';
INFILE FREQLONG;
INPUT
TYPE 1 @ ;
IF TYPE EQ 4 THEN DELETE;
IF TYPE EQ 5 THEN LINK LIGNE5;
RETURN;
LIGNE5:
INPUT var1 $ 5 - 7
var2 $ 2 - 4
varz 8 - 10
vara 11 - 13
varb $ 15;
DATA A3;
SET A1;
PROC SORT;
BY var1 var2;
RUN;
DATA A4;
SET A2;
PROC SORT;
BY var1 var2;
RUN;
DATA A5;
MERGE A4 A3;
BY var1 var2;
RUN;
如您所见,它拆分了文件,对变量进行排序,然后合并它们。然而,这是逐年完成的,我想多年来一直使用一个文件。
这是使用 tidyverse
的解决方案。
它创建一个只有 header 行的新列,然后用上面的 header 填充没有 header 的行。最后,如果需要,您可以 paste
将这些列放在一起。
x <- read.table(text = "4 64001416230519844TP blahblah
5416001130 1 F 492273
5416001140 3 F 492274
5416001145 1 F 492275
5416001150 19 F 492276
5416001155 21 F 492277
5416001160 21 F 492278
5416001165 13 F 492279
5416001170 3 F 492280
5416001180 1 F 492281
4 64001544250619844RA blahblah
5544001125 1 F 492291
5544001130 3 F 492292
5544001135 4 F 492293
5544001140 11 F 492294
5544001145 13 F 492295", header = FALSE, sep = "\t")
library("tidyverse")
x %>%
rename(body = V1) %>%
mutate(
body = trimws(body),
head = if_else(grepl("^4", body), body, NA_character_),
body = if_else(is.na(head), body, NA_character_)
) %>%
fill(head, .direction = "down") %>%
filter(!is.na(body))
输出
body head
1 5416001130 1 F 492273 4 64001416230519844TP blahblah
2 5416001140 3 F 492274 4 64001416230519844TP blahblah
3 5416001145 1 F 492275 4 64001416230519844TP blahblah
4 5416001150 19 F 492276 4 64001416230519844TP blahblah
5 5416001155 21 F 492277 4 64001416230519844TP blahblah
6 5416001160 21 F 492278 4 64001416230519844TP blahblah
7 5416001165 13 F 492279 4 64001416230519844TP blahblah
8 5416001170 3 F 492280 4 64001416230519844TP blahblah
9 5416001180 1 F 492281 4 64001416230519844TP blahblah
10 5544001125 1 F 492291 4 64001544250619844RA blahblah
11 5544001130 3 F 492292 4 64001544250619844RA blahblah
12 5544001135 4 F 492293 4 64001544250619844RA blahblah
13 5544001140 11 F 492294 4 64001544250619844RA blahblah
14 5544001145 13 F 492295 4 64001544250619844RA blahblah
另一个可能的解决方案(没有 tidyverse)是每行读入文件,查找 header 行并将这些行粘贴到没有 header 的行的末尾。之后,这些行被拆分并放入 data.frame.
lines <- readLines("asd.dat")
# last index + 1 for iteration
headers <- c(which(grepl("^4 ", lines)), length(lines) + 1)
pastedLines <- c()
for(i in 1:(length(headers) - 1)) {
pastedLines <- c(pastedLines,
paste(lines[(headers[i] + 1) : (headers[i + 1] - 1)], lines[headers[i]]))
}
DF <- as.data.frame(matrix(unlist(strsplit(pastedLines, "\s+")), nrow = length(pastedLines), byrow=T))
输出:
V1 V2 V3 V4 V5 V6 V7
1 5416001130 1 F 492273 4 64001416230519844TP blahblah
2 5416001140 3 F 492274 4 64001416230519844TP blahblah
3 5416001145 1 F 492275 4 64001416230519844TP blahblah
4 5416001150 19 F 492276 4 64001416230519844TP blahblah
5 5416001155 21 F 492277 4 64001416230519844TP blahblah
6 5416001160 21 F 492278 4 64001416230519844TP blahblah
7 5416001165 13 F 492279 4 64001416230519844TP blahblah
8 5416001170 3 F 492280 4 64001416230519844TP blahblah
9 5416001180 1 F 492281 4 64001416230519844TP blahblah
10 5544001125 1 F 492291 4 64001544250619844RA blahblah
11 5544001130 3 F 492292 4 64001544250619844RA blahblah
12 5544001135 4 F 492293 4 64001544250619844RA blahblah
13 5544001140 11 F 492294 4 64001544250619844RA blahblah
14 5544001145 13 F 492295 4 64001544250619844RA blahblah
两个基于 R 的选项。都使用 readLines
读取原始文本数据(请参阅本答案的末尾)。
选项 1:
i <- grepl(pattern = '^4 ', x)
x1 <- strsplit(x[!i], '\s+')
x2 <- strsplit(x[i], '\s+')
d1 <- do.call(rbind.data.frame, x1)
d2 <- do.call(rbind.data.frame, x2)
d <- cbind(d1, d2[cumsum(i)[-which(i)],])
names(d) <- paste0('V',1:ncol(d))
给出:
> d V1 V2 V3 V4 V5 V6 V7 1 5416001130 1 F 492273 4 64001416230519844TP blahblah 1.1 5416001140 3 F 492274 4 64001416230519844TP blahblah 1.2 5416001145 1 F 492275 4 64001416230519844TP blahblah 1.3 5416001150 19 F 492276 4 64001416230519844TP blahblah 1.4 5416001155 21 F 492277 4 64001416230519844TP blahblah 1.5 5416001160 21 F 492278 4 64001416230519844TP blahblah 1.6 5416001165 13 F 492279 4 64001416230519844TP blahblah 1.7 5416001170 3 F 492280 4 64001416230519844TP blahblah 1.8 5416001180 1 F 492281 4 64001416230519844TP blahblah 2 5544001125 1 F 492291 4 64001544250619844RA blahblah 2.1 5544001130 3 F 492292 4 64001544250619844RA blahblah 2.2 5544001135 4 F 492293 4 64001544250619844RA blahblah 2.3 5544001140 11 F 492294 4 64001544250619844RA blahblah 2.4 5544001145 13 F 492295 4 64001544250619844RA blahblah
选项 2:
rawlist <- split(x, cumsum(grepl(pattern = '^4 ', x)))
l1 <- lapply(rawlist, function(x) read.table(text = x, skip = 1, header = FALSE))
l2 <- lapply(rawlist, function(x) read.table(text = x, nrows = 1, header = FALSE))
reps <- sapply(l1, nrow)
d1 <- do.call(rbind, l1)
d2 <- do.call(rbind, l2)[rep(1:length(l2), reps),]
d <- cbind(d1, d2)
names(d) <- paste0('V',1:ncol(d))
给出:
> d V1 V2 V3 V4 V5 V6 V7 1.1 5416001130 1 FALSE 492273 4 64001416230519844TP blahblah 1.2 5416001140 3 FALSE 492274 4 64001416230519844TP blahblah 1.3 5416001145 1 FALSE 492275 4 64001416230519844TP blahblah 1.4 5416001150 19 FALSE 492276 4 64001416230519844TP blahblah 1.5 5416001155 21 FALSE 492277 4 64001416230519844TP blahblah 1.6 5416001160 21 FALSE 492278 4 64001416230519844TP blahblah 1.7 5416001165 13 FALSE 492279 4 64001416230519844TP blahblah 1.8 5416001170 3 FALSE 492280 4 64001416230519844TP blahblah 1.9 5416001180 1 FALSE 492281 4 64001416230519844TP blahblah 2.1 5544001125 1 FALSE 492291 4 64001544250619844RA blahblah 2.2 5544001130 3 FALSE 492292 4 64001544250619844RA blahblah 2.3 5544001135 4 FALSE 492293 4 64001544250619844RA blahblah 2.4 5544001140 11 FALSE 492294 4 64001544250619844RA blahblah 2.5 5544001145 13 FALSE 492295 4 64001544250619844RA blahblah
已用数据:
x <- readLines(textConnection('4 64001416230519844TP blahblah
5416001130 1 F 492273
5416001140 3 F 492274
5416001145 1 F 492275
5416001150 19 F 492276
5416001155 21 F 492277
5416001160 21 F 492278
5416001165 13 F 492279
5416001170 3 F 492280
5416001180 1 F 492281
4 64001544250619844RA blahblah
5544001125 1 F 492291
5544001130 3 F 492292
5544001135 4 F 492293
5544001140 11 F 492294
5544001145 13 F 492295'))
要读取您的实际数据,您可以使用如下内容:
x <- readLine('name-of-datafile.txt')
这是一个可能的基础 R 解决方案, 试图 提高内存效率:
rawtext <- "4 64001416230519844TP blahblah
5416001130 1 F 492273
5416001140 3 F 492274
5416001145 1 F 492275
5416001150 19 F 492276
5416001155 21 F 492277
5416001160 21 F 492278
5416001165 13 F 492279
5416001170 3 F 492280
5416001180 1 F 492281
4 64001544250619844RA blahblah
5544001125 1 F 492291
5544001130 3 F 492292
5544001135 4 F 492293
5544001140 11 F 492294
5544001145 13 F 492295"
先读取一次数据,得到header行号。请注意,这可以使用命令行实用程序来完成,例如... grep
,在 R:
text <- readLines(textConnection(rawtext))
header_rows <- grep("^4", text)
lengths <- diff(c(header_rows, length(text) + 1)) - 1
rm(text)
然后实际上 re-read 每件,但只有必要的行数:
do.call(rbind, mapply(
function(skip, nrows, ...) data.frame(
read.table(skip = skip, nrows = nrows, ...),
read.table(skip = skip - 1, nrows = 1, ...)
),
MoreArgs = list(text = rawtext),
header_rows,
lengths,
SIMPLIFY = FALSE
))
# V1 V2 V3 V4 V1.1 V2.1 V3.1
# 1 5416001130 1 FALSE 492273 4 64001416230519844TP blahblah
# 2 5416001140 3 FALSE 492274 4 64001416230519844TP blahblah
# 3 5416001145 1 FALSE 492275 4 64001416230519844TP blahblah
# 4 5416001150 19 FALSE 492276 4 64001416230519844TP blahblah
# 5 5416001155 21 FALSE 492277 4 64001416230519844TP blahblah
# 6 5416001160 21 FALSE 492278 4 64001416230519844TP blahblah
# 7 5416001165 13 FALSE 492279 4 64001416230519844TP blahblah
# 8 5416001170 3 FALSE 492280 4 64001416230519844TP blahblah
# 9 5416001180 1 FALSE 492281 4 64001416230519844TP blahblah
# 10 5544001125 1 FALSE 492291 4 64001544250619844RA blahblah
# 11 5544001130 3 FALSE 492292 4 64001544250619844RA blahblah
# 12 5544001135 4 FALSE 492293 4 64001544250619844RA blahblah
# 13 5544001140 11 FALSE 492294 4 64001544250619844RA blahblah
# 14 5544001145 13 FALSE 492295 4 64001544250619844RA blahblah