导入带格式的数据
Import data with format
我对导入多个数据有疑问(excel
或 csv
)。
我知道如果我想同时读取多个 excel 数据(相同的列名),代码是:
library(readxl)
file.list <- dir(path = "/path", pattern='\.xlsx', full.names = T)
df.list <- lapply(file.list, read_excel)
data <- rbindlist(df.list)
所以,我同时读取它们并合并为一个数据。
不过,我以我的一个数据为例。
在 excel 中,row1 有一个标题名称,而 row2 是 header
,所以 obs.
开始于第 3 行.
此外,如果 EXCEL 中的第一个数据如下所示:
TITLE~~~~~
col1 col2 col3
A 1 3.59283E+14
B 2 3.59258E+14
C 3 3.59286E+14
REFUND
A -1 3.59286E+14
但是,col3
在excel中被定义为数字。实际上,它看起来像:
TITLE~~~~~
col1 col2 col3
A 1 359283060959987
B 2 359258069826064
C 3 359286062903911
REFUND
A -1 359283060959987
第 6 行有 REFUND
。在我的多个数据中,我不知道哪一行有REFUND
。我想读我的obs。没有这些行。我该怎么办?
实际上,col3
是 character
。但在 excel 中,它看起来像 numeric
.
导入到R时如何定义为character
,所以导入后,它不显示指数符号。
我认为没有 REFUND 行就无法直接读取数据,至少使用 read_excel
函数是这样。
但是我在 R 中很新,我可能是错的。
也就是说,我想到的第一件事就是构建您自己的函数。下面的那个似乎有效。
library(readxl)
library(data.table)
file.list <- dir(path = ".", pattern='\.xlsx', full.names = T)
my_read_data<-function(x){ #x list of files
df.list<- lapply(x, function(x){read_excel(path=x,skip=1,col_names = TRUE,
col_types=c("text","numeric","text"))})
#skip -> skip the line with the title
#col_names -> use the first row as column names, i.e., col1, col2 and col3
#col_types-> vector containing one entry per column indicating the type of data
my.data <- rbindlist(df.list)
my.data.clean<-my.data[my.data$col1!="REFUND",] #select only rows without "REFUND"
return(my.data.clean)
}
为了 运行 函数,我将您的 excel 示例复制了四次,更改了 REFUND 行的位置。我得到的结果是
如下。
the.data<-my_read_data(file.list)
>the.data
col1 col2 col3
1: A 1 359283060959987
2: B 2 359258069826064
3: C 3 359286062903911
4: A -1 359283060959987
5: A 1 359283060959987
6: B 2 359258069826064
7: C 3 359286062903911
8: A -1 359283060959987
9: A 1 359283060959987
10: B 2 359258069826064
11: C 3 359286062903911
12: A -1 359283060959987
13: A 1 359283060959987
14: B 2 359258069826064
15: C 3 359286062903911
16: A -1 359283060959987
EDIT - 传递要更改为字符类型的列的函数
关于您的评论,也许您可以考虑改用此功能:
my_read_data2<-function(x,character_col=NULL){ #x->list of files
# character_col->column to be change to character
# can be more than one
df.list<- lapply(x, function(x){read_excel(path=x,skip=1,col_names = TRUE)})
my.data <- rbindlist(df.list)
my.data.clean<-my.data[my.data$col1!="REFUND",] #select only rows without "REFUND"
# changing column selected by character_col to character
# since the result from step above is a data table,
# access to elements is different from data frame
if(!is.null(character_col)){ #this allow you to use the function using only
# default results from read_excel
my.data.clean[, eval(character_col):= lapply(.SD, as.character),
.SDcols= character_col]
}
# eval -> you need to evaluate the argument you pass to the function,
# otherwise you'll end up with an additional character_col column
# that will be a list of all the columns you include in .SDcols
#.SD -> is the subset of the data table, in this case
# .SDcols specifies the columns that are included in .SD.
return(my.data.clean[]) # in that case, don't forget the [] to avoid
#the odd behaviour when calling your resulting data table
#(see link at the end)
}
示例:
the.data<-my_read_data2(file.list)
str(the.data)
>str(the.data)
Classes ‘data.table’ and 'data.frame': 16 obs. of 3 variables:
$ col1: chr "A" "B" "C" "A" ...
$ col2: num 1 2 3 -1 1 2 3 -1 1 2 ...
$ col3: num 3.59e+14 3.59e+14 3.59e+14 3.59e+14 3.59e+14 ...
- attr(*, ".internal.selfref")=<externalptr>
the.data1<-my_read_data2(file.list,"col3")
str(the.data1)
> str(the.data1)
Classes ‘data.table’ and 'data.frame': 16 obs. of 3 variables:
$ col1: chr "A" "B" "C" "A" ...
$ col2: num 1 2 3 -1 1 2 3 -1 1 2 ...
$ col3: chr "359283060959987" "359258069826064" "359286062903911" "359283060959987" ...
- attr(*, ".internal.selfref")=<externalptr>
您还可以使用多个列:
the.data2<-my_read_data2(file.list,c("col2","col3"))
the.data3<-my_read_data2(file.list,c(2,3))
希望对你有帮助
我对导入多个数据有疑问(excel
或 csv
)。
我知道如果我想同时读取多个 excel 数据(相同的列名),代码是:
library(readxl)
file.list <- dir(path = "/path", pattern='\.xlsx', full.names = T)
df.list <- lapply(file.list, read_excel)
data <- rbindlist(df.list)
所以,我同时读取它们并合并为一个数据。
不过,我以我的一个数据为例。
在 excel 中,row1 有一个标题名称,而 row2 是 header
,所以 obs.
开始于第 3 行.
此外,如果 EXCEL 中的第一个数据如下所示:
TITLE~~~~~
col1 col2 col3
A 1 3.59283E+14
B 2 3.59258E+14
C 3 3.59286E+14
REFUND
A -1 3.59286E+14
但是,col3
在excel中被定义为数字。实际上,它看起来像:
TITLE~~~~~
col1 col2 col3
A 1 359283060959987
B 2 359258069826064
C 3 359286062903911
REFUND
A -1 359283060959987
第 6 行有 REFUND
。在我的多个数据中,我不知道哪一行有REFUND
。我想读我的obs。没有这些行。我该怎么办?
实际上,col3
是 character
。但在 excel 中,它看起来像 numeric
.
导入到R时如何定义为character
,所以导入后,它不显示指数符号。
我认为没有 REFUND 行就无法直接读取数据,至少使用 read_excel
函数是这样。
但是我在 R 中很新,我可能是错的。
也就是说,我想到的第一件事就是构建您自己的函数。下面的那个似乎有效。
library(readxl)
library(data.table)
file.list <- dir(path = ".", pattern='\.xlsx', full.names = T)
my_read_data<-function(x){ #x list of files
df.list<- lapply(x, function(x){read_excel(path=x,skip=1,col_names = TRUE,
col_types=c("text","numeric","text"))})
#skip -> skip the line with the title
#col_names -> use the first row as column names, i.e., col1, col2 and col3
#col_types-> vector containing one entry per column indicating the type of data
my.data <- rbindlist(df.list)
my.data.clean<-my.data[my.data$col1!="REFUND",] #select only rows without "REFUND"
return(my.data.clean)
}
为了 运行 函数,我将您的 excel 示例复制了四次,更改了 REFUND 行的位置。我得到的结果是 如下。
the.data<-my_read_data(file.list)
>the.data
col1 col2 col3
1: A 1 359283060959987
2: B 2 359258069826064
3: C 3 359286062903911
4: A -1 359283060959987
5: A 1 359283060959987
6: B 2 359258069826064
7: C 3 359286062903911
8: A -1 359283060959987
9: A 1 359283060959987
10: B 2 359258069826064
11: C 3 359286062903911
12: A -1 359283060959987
13: A 1 359283060959987
14: B 2 359258069826064
15: C 3 359286062903911
16: A -1 359283060959987
EDIT - 传递要更改为字符类型的列的函数
关于您的评论,也许您可以考虑改用此功能:
my_read_data2<-function(x,character_col=NULL){ #x->list of files
# character_col->column to be change to character
# can be more than one
df.list<- lapply(x, function(x){read_excel(path=x,skip=1,col_names = TRUE)})
my.data <- rbindlist(df.list)
my.data.clean<-my.data[my.data$col1!="REFUND",] #select only rows without "REFUND"
# changing column selected by character_col to character
# since the result from step above is a data table,
# access to elements is different from data frame
if(!is.null(character_col)){ #this allow you to use the function using only
# default results from read_excel
my.data.clean[, eval(character_col):= lapply(.SD, as.character),
.SDcols= character_col]
}
# eval -> you need to evaluate the argument you pass to the function,
# otherwise you'll end up with an additional character_col column
# that will be a list of all the columns you include in .SDcols
#.SD -> is the subset of the data table, in this case
# .SDcols specifies the columns that are included in .SD.
return(my.data.clean[]) # in that case, don't forget the [] to avoid
#the odd behaviour when calling your resulting data table
#(see link at the end)
}
示例:
the.data<-my_read_data2(file.list)
str(the.data)
>str(the.data)
Classes ‘data.table’ and 'data.frame': 16 obs. of 3 variables:
$ col1: chr "A" "B" "C" "A" ...
$ col2: num 1 2 3 -1 1 2 3 -1 1 2 ...
$ col3: num 3.59e+14 3.59e+14 3.59e+14 3.59e+14 3.59e+14 ...
- attr(*, ".internal.selfref")=<externalptr>
the.data1<-my_read_data2(file.list,"col3")
str(the.data1)
> str(the.data1)
Classes ‘data.table’ and 'data.frame': 16 obs. of 3 variables:
$ col1: chr "A" "B" "C" "A" ...
$ col2: num 1 2 3 -1 1 2 3 -1 1 2 ...
$ col3: chr "359283060959987" "359258069826064" "359286062903911" "359283060959987" ...
- attr(*, ".internal.selfref")=<externalptr>
您还可以使用多个列:
the.data2<-my_read_data2(file.list,c("col2","col3"))
the.data3<-my_read_data2(file.list,c(2,3))
希望对你有帮助