R:从列 headers(不同空格)略有不同的 txt 文件中读取特定列并绑定它们?
R: Reading specific columns from txt files with slightly different column headers (differing spaces) and binding them?
我有很多 txt
文件,这些文件在用 ; 分隔的列中包含相同类型的数字数据。但是有些文件的 header 列有空格,有些则没有(由不同的人创建)。有些有我不想要的额外列。
例如一个文件可能有一个 header 像:
ASomeName; BSomeName; C(someName%)
而另一个文件 header 可能是
A Some Name; B Some Name; C(someName%); D some name
如何在调用“读取”命令之前清除名称中的空格?
#These are the files I have
filenames<-list.files(pattern = "*.txt",recursive = TRUE,full.names = TRUE)%>%as_tibble()
#These are the columns I would like:
colSelect=c("Date","Time","Timestamp" ,"PM2_5(ug/m3)","PM10(ug/m3)","PM01(ug/m3)","Temperature(C)", "Humidity(%RH)", "CO2(ppm)")
#This is how I read them if they have the same columns
ldf <- vroom::vroom(filenames, col_select = colSelect,delim=";",id = "sensor" )%>%janitor::clean_names()
清理 Headers 脚本
我写了一个破坏性的脚本,它会读取整个文件,清理 header 的空格,删除文件和 re-write(vroom 有时会抱怨无法打开 X 千个文件)使用相同名称的文件。不是一种高效的做事方式。
cleanHeaders<-function(filename){
d<-vroom::vroom(filename,delim=";")%>%janitor::clean_names()
#print(head(d))
if (file.exists(filename)) {
#Delete file if it exists
file.remove(filename)
}
vroom::vroom_write(d,filename,delim = ";")
}
lapply(filenames,cleanHeaders)
fread 的 select
参数接受整数索引。如果所需的列始终位于同一位置,则您的工作完成了。
colIndexes = c(1,3,4,7,9,18,21)
data = lapply(filenames, fread, select = colIndexes)
我想 vroom 也有这个功能,但是既然你已经选择了你想要的列,我认为懒惰地评估你的角色列根本没有帮助,所以我建议你坚持 data.table .
不过,对于更强大的解决方案,因为您无法控制表的结构:您可以读取每个文件的一行,捕获并清理列名,然后将它们与您的干净版本进行匹配colSelect
向量。
library(data.table)
library(janitor)
library(purrr)
filenames <- list.files(pattern = "*.txt",
recursive = TRUE,
full.names = TRUE)
# read the first row of data to capture and clean the column names
clean_col_names <- function(filename){
colnames(janitor::clean_names(fread(filename, nrow = 1)))
}
clean_column_names <- map(.x = filenames,
.f = clean_col_names)
# clean the colSelect vector
colSelect <- janitor::make_clean_names(c("Date",
"Time",
"Timestamp" ,
"PM2_5(ug/m3)",
"PM10(ug/m3)",
"PM01(ug/m3)",
"Temperature(C)",
"Humidity(%RH)",
"CO2(ppm)"))
# match each set of column names against the clean colSelect
select_indices <- map(.x = clean_column_names,
.f = function(cols) match(colSelect, cols))
# use map2 to read only the matched indexes for each column
data <- purrr::map2(.x = filenames,
.y = select_indices,
~fread(input = .x, select = .y))
(这里的 purrr 可以很容易地用传统的 lapply 代替,我选择 purrr 是因为它的公式符号更清晰)
我有很多 txt
文件,这些文件在用 ; 分隔的列中包含相同类型的数字数据。但是有些文件的 header 列有空格,有些则没有(由不同的人创建)。有些有我不想要的额外列。
例如一个文件可能有一个 header 像:
ASomeName; BSomeName; C(someName%)
而另一个文件 header 可能是
A Some Name; B Some Name; C(someName%); D some name
如何在调用“读取”命令之前清除名称中的空格?
#These are the files I have
filenames<-list.files(pattern = "*.txt",recursive = TRUE,full.names = TRUE)%>%as_tibble()
#These are the columns I would like:
colSelect=c("Date","Time","Timestamp" ,"PM2_5(ug/m3)","PM10(ug/m3)","PM01(ug/m3)","Temperature(C)", "Humidity(%RH)", "CO2(ppm)")
#This is how I read them if they have the same columns
ldf <- vroom::vroom(filenames, col_select = colSelect,delim=";",id = "sensor" )%>%janitor::clean_names()
清理 Headers 脚本
我写了一个破坏性的脚本,它会读取整个文件,清理 header 的空格,删除文件和 re-write(vroom 有时会抱怨无法打开 X 千个文件)使用相同名称的文件。不是一种高效的做事方式。
cleanHeaders<-function(filename){
d<-vroom::vroom(filename,delim=";")%>%janitor::clean_names()
#print(head(d))
if (file.exists(filename)) {
#Delete file if it exists
file.remove(filename)
}
vroom::vroom_write(d,filename,delim = ";")
}
lapply(filenames,cleanHeaders)
fread 的 select
参数接受整数索引。如果所需的列始终位于同一位置,则您的工作完成了。
colIndexes = c(1,3,4,7,9,18,21)
data = lapply(filenames, fread, select = colIndexes)
我想 vroom 也有这个功能,但是既然你已经选择了你想要的列,我认为懒惰地评估你的角色列根本没有帮助,所以我建议你坚持 data.table .
不过,对于更强大的解决方案,因为您无法控制表的结构:您可以读取每个文件的一行,捕获并清理列名,然后将它们与您的干净版本进行匹配colSelect
向量。
library(data.table)
library(janitor)
library(purrr)
filenames <- list.files(pattern = "*.txt",
recursive = TRUE,
full.names = TRUE)
# read the first row of data to capture and clean the column names
clean_col_names <- function(filename){
colnames(janitor::clean_names(fread(filename, nrow = 1)))
}
clean_column_names <- map(.x = filenames,
.f = clean_col_names)
# clean the colSelect vector
colSelect <- janitor::make_clean_names(c("Date",
"Time",
"Timestamp" ,
"PM2_5(ug/m3)",
"PM10(ug/m3)",
"PM01(ug/m3)",
"Temperature(C)",
"Humidity(%RH)",
"CO2(ppm)"))
# match each set of column names against the clean colSelect
select_indices <- map(.x = clean_column_names,
.f = function(cols) match(colSelect, cols))
# use map2 to read only the matched indexes for each column
data <- purrr::map2(.x = filenames,
.y = select_indices,
~fread(input = .x, select = .y))
(这里的 purrr 可以很容易地用传统的 lapply 代替,我选择 purrr 是因为它的公式符号更清晰)