如何将使用多个定界符的奇怪文件类型读入 R?
How can I read an odd file type that uses multiple delimiters into R?
我的源文件来自一台旧的测试机器,它会吐出“*.ctf”文件。当我使用 readLines() 打开文件时,我得到一个长向量,其部分前面有“[header_name]”,并且在这些部分内(在 headers 之间)4 列由制表符分隔"\t".
理想情况下,我想将每个部分分成自己的 list/dataframe,每个部分有 4 列。
这是使用 readLines() 将向量读入 R 后的示例
(注意我从第5行跳到了第21行)
vector
[1] "[HEADER]" "Created by Sigma-1 ICON Version 4.5.3; Copyright 2005, GEOTAC"
[3] "Project:\tACC#1210004 \tLoad Frame Name:\tLoad Frame" "Date:\t1/1/2002 \tTime:\t12:39:01 AM "
[5] "Boring:\tBoring2\tSample:\tSample7"
...
[21] "" "[STEP 1]\t850\t0"
[23] "Time\tExternal Load Cell\tDCDT\tPlaten Position" "1/1/2002 12:40:52 AM\t-2.31623424260761E-04 \t 3.45233241577262 \t 3150948 "
[25] "1/1/2002 12:41:07 AM\t-3.22715023139608E-04 \t 3.45440429846349 \t 3157103 " "1/1/2002 12:41:22 AM\t-3.2964900303341E-04 \t 3.4553244755898 \t 3158611 "
理想情况下,读取文件会产生多个列表,这些列表由 [header] 命名,并由“\t”分隔为 4 列,前 4 列是 header 列。例如,[STEP 1] 在 EXCEL 中看起来像这样,而与此类似的数据框将是完美的。
我希望像 read.table 这样的东西可以使用制表符分隔符来处理这个问题,但是它会抛出错误,因为有多个列彼此重叠。
编辑以回应评论:
> dput(head(vector, 40))
c("[HEADER]", "Created by Sigma-1 ICON Version 4.5.3; Copyright 2005, GEOTAC",
"Project:\tACC#1210004 \tLoad Frame Name:\tLoad Frame", "Date:\t1/1/2002 \tTime:\t12:39:01 AM ",
"Boring:\tBoring2\tSample:\tSample7", "Specimen:\tSpecimen1\tDepth (ft):\t 21 ",
"Diameter (inch):\t 2.5025 \tHeight (inch):\t 1.00825 ", "Comments:\tTare J 217.028 paper .311 .463 wet weight 379.024g",
"", "[SENSORS]", "Name\tExternal Load Cell\tDCDT\tLoad Frame Encoder",
"ID\t227396\tLP183\tN/A", "Module\tLoad Frame ADIO\tLoad Frame ADIO\tN/A",
"Channel\t 1 \t 2 \tN/A", "Unit\tlbs\tinch\tinch", "Cal. Factor\t-796107.1205 \t 3.02704684 \t 3940000 ",
"Excitation\t 9.98139953613281 \t 9.98139953613281 \tN/A", "Zero\t 3.38862647549831E-05 \t 3.10994816131097 \tN/A",
"Min. Reading\t-1000 \t-0.05 \t0.0", "Max. Reading\t 2000 \t 3 \t 1.5 ",
"", "[STEP 1]\t850\t0", "Time \tExternal Load Cell\tDCDT\tPlaten Position",
"1/1/2002 12:40:52 AM\t-2.31623424260761E-04 \t 3.45233241577262 \t 3150948 ",
"1/1/2002 12:41:07 AM\t-3.22715023139608E-04 \t 3.45440429846349 \t 3157103 ",
"1/1/2002 12:41:22 AM\t-3.2964900303341E-04 \t 3.4553244755898 \t 3158611 ",
"1/1/2002 12:41:38 AM\t-3.35823094719672E-04 \t 3.45592288755324 \t 3159627 ",
"1/1/2002 12:41:53 AM\t-3.34113346252707E-04 \t 3.45707221846715 \t 3160244 ",
"1/1/2002 12:42:24 AM\t-3.25707082956796E-04 \t 3.45724794261514 \t 3160806 ",
"1/1/2002 12:42:54 AM\t-3.34350811317563E-04 \t 3.45749134430662 \t 3161526 ",
"1/1/2002 12:43:24 AM\t-3.32652936103841E-04 \t 3.4578036108669 \t 3161849 ",
"1/1/2002 12:43:54 AM\t-3.31216272461461E-04 \t 3.45799833222009 \t 3162033 ",
"1/1/2002 12:44:54 AM\t-3.2508967378817E-04 \t 3.45834978051607 \t 3162380 ",
"1/1/2002 12:45:54 AM\t-3.28473550962372E-04 \t 3.45827497902064 \t 3162464 ",
"1/1/2002 12:46:54 AM\t-3.32878527915454E-04 \t 3.4585171933868 \t 3162704 ",
"1/1/2002 12:47:54 AM\t-3.23534277613362E-04 \t 3.45914291383269 \t 3161933 ",
"1/1/2002 12:49:54 AM\t-3.38494576699304E-04 \t 3.45977932020651 \t 3162452 ",
"1/1/2002 12:50:56 AM\t-3.31038173662819E-04 \t 3.45979950473702 \t 3159002 ",
"", "[STEP 2]\t1700\t0")
如果您的数据遵循 [STEP1][COLUMNNAMES][DATA][STEP2][COLUMNNAMES][DATA] 模式......我认为这会起作用。
start <- grep('^Time', vector)
end <- grep('\[STEP', vector)[-1] - 2
result <- do.call(rbind, Map(function(x, y)
read.csv(text = paste0(vector[x:y], collapse = '\n'), sep = '\t'),
start, end))
result
这里的逻辑是,我们假设第一列名称是 'Time'
,数据从那里开始,直到找到下一个 STEP。
我一直从社交媒体监控输出中得到类似的问题。
在这里,我假设 rl_text
是您在 dput 中粘贴的行向量。
我加载 tidyverse
和 splitstackshape
包。
library(tidyverse)
library(splitstackshape)
df_raw <-
rl_text %>%
as_tibble() %>%
rowid_to_column(var = "line_id") %>%
splitstackshape::cSplit("value", sep = "\t", direction = "wide") %>%
mutate_if(.predicate = is.factor, as.character) %>%
mutate(is_header = grepl("^\[.*\]$", value_1), ## Here I check if the first column has a header, identified as a string that begins and ends in straight brackets []
header = ifelse(is_header == TRUE, value_1, NA)) %>%
fill(header)
ls_headers <- unique(df_raw$header)
unnest_dfs <- function(headers, df = df_raw) {
## Function to get a df out of those rows under a common header.
df_filtered <- filter(df, header == headers) %>% select(-c(line_id, is_header, header))
df_filtered
}
list_with.dfs <- map(ls_headers, unnest_dfs)
[[1]]
# A tibble: 9 x 4
value_1 value_2 value_3 value_4
<chr> <chr> <chr> <chr>
1 [HEADER] NA NA NA
2 Created by Sigma-1 ICON Version 4.5.3; Copyright 2005, GE~ NA NA NA
3 Project: ACC#1210004 Load Frame Nam~ Load Frame
...
[[2]]
# A tibble: 12 x 4
value_1 value_2 value_3 value_4
<chr> <chr> <chr> <chr>
1 [SENSORS] NA NA NA
2 Name External Load Cell DCDT Load Frame Encoder
3 ID 227396 LP183 N/A
...
[[3]]
# A tibble: 18 x 4
value_1 value_2 value_3 value_4
<chr> <chr> <chr> <chr>
1 [STEP 1] 850 0 NA
2 Time External Load Cell DCDT Platen Position
3 1/1/2002 12:40:52 AM -2.31623424260761E-04 3.45233241577262 3150948
...
[[4]]
# A tibble: 1 x 4
value_1 value_2 value_3 value_4
<chr> <chr> <chr> <chr>
1 [STEP 2] 1700 0 NA
我的源文件来自一台旧的测试机器,它会吐出“*.ctf”文件。当我使用 readLines() 打开文件时,我得到一个长向量,其部分前面有“[header_name]”,并且在这些部分内(在 headers 之间)4 列由制表符分隔"\t".
理想情况下,我想将每个部分分成自己的 list/dataframe,每个部分有 4 列。
这是使用 readLines() 将向量读入 R 后的示例 (注意我从第5行跳到了第21行)
vector
[1] "[HEADER]" "Created by Sigma-1 ICON Version 4.5.3; Copyright 2005, GEOTAC"
[3] "Project:\tACC#1210004 \tLoad Frame Name:\tLoad Frame" "Date:\t1/1/2002 \tTime:\t12:39:01 AM "
[5] "Boring:\tBoring2\tSample:\tSample7"
...
[21] "" "[STEP 1]\t850\t0"
[23] "Time\tExternal Load Cell\tDCDT\tPlaten Position" "1/1/2002 12:40:52 AM\t-2.31623424260761E-04 \t 3.45233241577262 \t 3150948 "
[25] "1/1/2002 12:41:07 AM\t-3.22715023139608E-04 \t 3.45440429846349 \t 3157103 " "1/1/2002 12:41:22 AM\t-3.2964900303341E-04 \t 3.4553244755898 \t 3158611 "
理想情况下,读取文件会产生多个列表,这些列表由 [header] 命名,并由“\t”分隔为 4 列,前 4 列是 header 列。例如,[STEP 1] 在 EXCEL 中看起来像这样,而与此类似的数据框将是完美的。
我希望像 read.table 这样的东西可以使用制表符分隔符来处理这个问题,但是它会抛出错误,因为有多个列彼此重叠。
编辑以回应评论:
> dput(head(vector, 40))
c("[HEADER]", "Created by Sigma-1 ICON Version 4.5.3; Copyright 2005, GEOTAC",
"Project:\tACC#1210004 \tLoad Frame Name:\tLoad Frame", "Date:\t1/1/2002 \tTime:\t12:39:01 AM ",
"Boring:\tBoring2\tSample:\tSample7", "Specimen:\tSpecimen1\tDepth (ft):\t 21 ",
"Diameter (inch):\t 2.5025 \tHeight (inch):\t 1.00825 ", "Comments:\tTare J 217.028 paper .311 .463 wet weight 379.024g",
"", "[SENSORS]", "Name\tExternal Load Cell\tDCDT\tLoad Frame Encoder",
"ID\t227396\tLP183\tN/A", "Module\tLoad Frame ADIO\tLoad Frame ADIO\tN/A",
"Channel\t 1 \t 2 \tN/A", "Unit\tlbs\tinch\tinch", "Cal. Factor\t-796107.1205 \t 3.02704684 \t 3940000 ",
"Excitation\t 9.98139953613281 \t 9.98139953613281 \tN/A", "Zero\t 3.38862647549831E-05 \t 3.10994816131097 \tN/A",
"Min. Reading\t-1000 \t-0.05 \t0.0", "Max. Reading\t 2000 \t 3 \t 1.5 ",
"", "[STEP 1]\t850\t0", "Time \tExternal Load Cell\tDCDT\tPlaten Position",
"1/1/2002 12:40:52 AM\t-2.31623424260761E-04 \t 3.45233241577262 \t 3150948 ",
"1/1/2002 12:41:07 AM\t-3.22715023139608E-04 \t 3.45440429846349 \t 3157103 ",
"1/1/2002 12:41:22 AM\t-3.2964900303341E-04 \t 3.4553244755898 \t 3158611 ",
"1/1/2002 12:41:38 AM\t-3.35823094719672E-04 \t 3.45592288755324 \t 3159627 ",
"1/1/2002 12:41:53 AM\t-3.34113346252707E-04 \t 3.45707221846715 \t 3160244 ",
"1/1/2002 12:42:24 AM\t-3.25707082956796E-04 \t 3.45724794261514 \t 3160806 ",
"1/1/2002 12:42:54 AM\t-3.34350811317563E-04 \t 3.45749134430662 \t 3161526 ",
"1/1/2002 12:43:24 AM\t-3.32652936103841E-04 \t 3.4578036108669 \t 3161849 ",
"1/1/2002 12:43:54 AM\t-3.31216272461461E-04 \t 3.45799833222009 \t 3162033 ",
"1/1/2002 12:44:54 AM\t-3.2508967378817E-04 \t 3.45834978051607 \t 3162380 ",
"1/1/2002 12:45:54 AM\t-3.28473550962372E-04 \t 3.45827497902064 \t 3162464 ",
"1/1/2002 12:46:54 AM\t-3.32878527915454E-04 \t 3.4585171933868 \t 3162704 ",
"1/1/2002 12:47:54 AM\t-3.23534277613362E-04 \t 3.45914291383269 \t 3161933 ",
"1/1/2002 12:49:54 AM\t-3.38494576699304E-04 \t 3.45977932020651 \t 3162452 ",
"1/1/2002 12:50:56 AM\t-3.31038173662819E-04 \t 3.45979950473702 \t 3159002 ",
"", "[STEP 2]\t1700\t0")
如果您的数据遵循 [STEP1][COLUMNNAMES][DATA][STEP2][COLUMNNAMES][DATA] 模式......我认为这会起作用。
start <- grep('^Time', vector)
end <- grep('\[STEP', vector)[-1] - 2
result <- do.call(rbind, Map(function(x, y)
read.csv(text = paste0(vector[x:y], collapse = '\n'), sep = '\t'),
start, end))
result
这里的逻辑是,我们假设第一列名称是 'Time'
,数据从那里开始,直到找到下一个 STEP。
我一直从社交媒体监控输出中得到类似的问题。
在这里,我假设 rl_text
是您在 dput 中粘贴的行向量。
我加载 tidyverse
和 splitstackshape
包。
library(tidyverse)
library(splitstackshape)
df_raw <-
rl_text %>%
as_tibble() %>%
rowid_to_column(var = "line_id") %>%
splitstackshape::cSplit("value", sep = "\t", direction = "wide") %>%
mutate_if(.predicate = is.factor, as.character) %>%
mutate(is_header = grepl("^\[.*\]$", value_1), ## Here I check if the first column has a header, identified as a string that begins and ends in straight brackets []
header = ifelse(is_header == TRUE, value_1, NA)) %>%
fill(header)
ls_headers <- unique(df_raw$header)
unnest_dfs <- function(headers, df = df_raw) {
## Function to get a df out of those rows under a common header.
df_filtered <- filter(df, header == headers) %>% select(-c(line_id, is_header, header))
df_filtered
}
list_with.dfs <- map(ls_headers, unnest_dfs)
[[1]]
# A tibble: 9 x 4
value_1 value_2 value_3 value_4
<chr> <chr> <chr> <chr>
1 [HEADER] NA NA NA
2 Created by Sigma-1 ICON Version 4.5.3; Copyright 2005, GE~ NA NA NA
3 Project: ACC#1210004 Load Frame Nam~ Load Frame
...
[[2]]
# A tibble: 12 x 4
value_1 value_2 value_3 value_4
<chr> <chr> <chr> <chr>
1 [SENSORS] NA NA NA
2 Name External Load Cell DCDT Load Frame Encoder
3 ID 227396 LP183 N/A
...
[[3]]
# A tibble: 18 x 4
value_1 value_2 value_3 value_4
<chr> <chr> <chr> <chr>
1 [STEP 1] 850 0 NA
2 Time External Load Cell DCDT Platen Position
3 1/1/2002 12:40:52 AM -2.31623424260761E-04 3.45233241577262 3150948
...
[[4]]
# A tibble: 1 x 4
value_1 value_2 value_3 value_4
<chr> <chr> <chr> <chr>
1 [STEP 2] 1700 0 NA