读取多个 excel 个文件,添加一列,然后绑定
Reading in multiple excel files, adding a column, then binding
我有一系列 Excel 文件要读入 R,根据文件名添加日期列,然后绑定在一起。
文件的命名约定为User_Info_Jan、User_Info_Feb、User_Info_Mar。月份仅在文件名中引用,实际文件本身并未提及。 User_Info_Jan 文件的示例:
ID Name
ABC Joe Smith
DEF Henry Cooper
ZCS Kelly Ma
有没有一种方法可以使用文件名中的模式(模式 = User_Info_)读取文件,然后在绑定之前添加一个名为“Month”的列来指示文件的月份一起?
月份列后的示例数据框:
ID Name Month
ABC Joe Smith January
DEF Henry Cooper January
ZCS Kelly Ma January
绑定后的示例数据框:
ID Name Usage Month
ABC Joe Smith January
DEF Henry Cooper January
ZCS Kelly Ma January
KFY Lisa Schwartz February
LFG Alex Shah March
我会使用 purrr 库中的 map()
函数来解决这个问题。
由于我们正在重新编辑文件,因此没有可重现的格式,我最近的代码示例如下:
# Get all the filenames (I assume this contains the month data in your case)
GravfilesMap <- list.files("GravityModel/MapOut", full.names = T)
GravMap <-
# Use Regex to select the string for the month for you I would try "_([a-zA-Z]+).xlsx" passed to the str_match function (this gets the month names as a column)
(GravfilesMap %>% str_match("(\d+).csv$"))[,2] %>%
# Convert to a data frame
tibble %>%
# For each file_name read in the data to its own data frame (this will give on each row a month name and then a nested dataframe)
# I have used read_csv here you will use something like read_xls
# The order of the files is the same as the order of our months as we are importing them in the order specified by the list
mutate(file_contents = map(GravfilesMap, ~read_csv(., col_names = F)))
# Unnest the dataframes to appear in the form that was requested
GravMap <- GravMap %>% unnest()
有关类似方法的详细信息,请参见 https://clauswilke.com/blog/2016/06/13/reading-and-combining-many-tidy-data-files-in-r/
我将使用假文件名进行演示,但我建议您 运行 使用相同结构注释掉真正的命令。我假设 .xlsx
用于“excel 个文件”,但这同样适用于 .csv
(只需更新模式)。
# files <- list.files(path = ".", pattern = "User_Info_.*\.xlsx$", full.names = TRUE)
files <- c("./User_Info_Jan.xlsx", "./User_Info_Feb.xlsx", "./User_Info_Mar.xlsx")
monthnames <- strcapture("User_Info_(.*)\.xlsx", files, list(month = ""))
monthnames
# month
# 1 Jan
# 2 Feb
# 3 Mar
至此,我们已经从每个文件名中提取了月份名称。我发现 strcapture
(在 base R 中)比 gsub
更好,因为后者 returns 整个字符串如果没有匹配; base R 中的另一种选择是 regmatches(files, gregexpr(...))
,但这似乎比这里需要的要复杂一些。另一种选择是 stringr::str_extract
如果您已经在使用 stringr
and/or 其他 tidyverse 包,它可能更直观。
从这里,我们可以遍历文件以读入它们。
# out <- Map(function(mn, fn) transform(readxl::read_excel(fn), month = mn), monthnames$month, files)
set.seed(42)
out <- Map(function(mn, fn) transform(mtcars[sample(32,size=2),], month = mn), monthnames$month, files)
out
# $Jan
# mpg cyl disp hp drat wt qsec vs am gear carb month
# Chrysler Imperial 14.7 8 440 230 3.23 5.345 17.42 0 0 3 4 Jan
# Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 Jan
# $Feb
# mpg cyl disp hp drat wt qsec vs am gear carb month
# Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 Feb
# Pontiac Firebird 19.2 8 400 175 3.08 3.845 17.05 0 0 3 2 Feb
# $Mar
# mpg cyl disp hp drat wt qsec vs am gear carb month
# Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4 Mar
# Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 Mar
将该帧列表合并为一个帧是直接的:
do.call(rbind, out)
# mpg cyl disp hp drat wt qsec vs am gear carb month
# Jan.Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4 Jan
# Jan.Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 Jan
# Feb.Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 Feb
# Feb.Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2 Feb
# Mar.Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4 Mar
# Mar.Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 Mar
所有这些的替代方法可以使用 data.table::rbindlist
或 dplyr::bind_rows
,并直接分配“id”列:
# out <- Map(function(mn, fn) readxl::read_excel(fn), monthnames$month, files)
set.seed(42)
out <- Map(function(mn, fn) mtcars[sample(32,size=2),], monthnames$month, files)
data.table::rbindlist(out, idcol = "month")
# month mpg cyl disp hp drat wt qsec vs am gear carb
# <char> <num> <num> <num> <num> <num> <num> <num> <num> <num> <num> <num>
# 1: Jan 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
# 2: Jan 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
# 3: Feb 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
# 4: Feb 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
# 5: Mar 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
# 6: Mar 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
dplyr::bind_rows(out, .id = "month")
# month mpg cyl disp hp drat wt qsec vs am gear carb
# Chrysler Imperial Jan 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
# Hornet Sportabout Jan 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
# Mazda RX4 Feb 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
# Pontiac Firebird Feb 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
# Merc 280 Mar 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
# Hornet 4 Drive Mar 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
后两个之所以起作用,是因为当我之前调用 Map
时,传递给内部函数的第一个参数 (monthnames$month
) 用作 list
输出的名称,它这就是为什么您将 $Jan
等视为返回列表的元素的原因。当使用 idcol=
/.id=
时,rbindlist
和 bind_rows
都将这些名称用作“id”列。 (如果实际上没有“名称”存在,则两个函数都算在内。)
您可以像这样尝试 purrr
包:
files <- c("./User_Info_Jan.xlsx", "./User_Info_Feb.xlsx", "./User_Info_Mar.xlsx")
months <- c("Jan","Feb","Mar")
library(openxlsx)
library(purrr)
map2_dfr(files,months,function(x,y) read.xlsx(x) %>% mutate(Month=y))
我有一系列 Excel 文件要读入 R,根据文件名添加日期列,然后绑定在一起。
文件的命名约定为User_Info_Jan、User_Info_Feb、User_Info_Mar。月份仅在文件名中引用,实际文件本身并未提及。 User_Info_Jan 文件的示例:
ID Name
ABC Joe Smith
DEF Henry Cooper
ZCS Kelly Ma
有没有一种方法可以使用文件名中的模式(模式 = User_Info_)读取文件,然后在绑定之前添加一个名为“Month”的列来指示文件的月份一起?
月份列后的示例数据框:
ID Name Month
ABC Joe Smith January
DEF Henry Cooper January
ZCS Kelly Ma January
绑定后的示例数据框:
ID Name Usage Month
ABC Joe Smith January
DEF Henry Cooper January
ZCS Kelly Ma January
KFY Lisa Schwartz February
LFG Alex Shah March
我会使用 purrr 库中的 map()
函数来解决这个问题。
由于我们正在重新编辑文件,因此没有可重现的格式,我最近的代码示例如下:
# Get all the filenames (I assume this contains the month data in your case)
GravfilesMap <- list.files("GravityModel/MapOut", full.names = T)
GravMap <-
# Use Regex to select the string for the month for you I would try "_([a-zA-Z]+).xlsx" passed to the str_match function (this gets the month names as a column)
(GravfilesMap %>% str_match("(\d+).csv$"))[,2] %>%
# Convert to a data frame
tibble %>%
# For each file_name read in the data to its own data frame (this will give on each row a month name and then a nested dataframe)
# I have used read_csv here you will use something like read_xls
# The order of the files is the same as the order of our months as we are importing them in the order specified by the list
mutate(file_contents = map(GravfilesMap, ~read_csv(., col_names = F)))
# Unnest the dataframes to appear in the form that was requested
GravMap <- GravMap %>% unnest()
有关类似方法的详细信息,请参见 https://clauswilke.com/blog/2016/06/13/reading-and-combining-many-tidy-data-files-in-r/
我将使用假文件名进行演示,但我建议您 运行 使用相同结构注释掉真正的命令。我假设 .xlsx
用于“excel 个文件”,但这同样适用于 .csv
(只需更新模式)。
# files <- list.files(path = ".", pattern = "User_Info_.*\.xlsx$", full.names = TRUE)
files <- c("./User_Info_Jan.xlsx", "./User_Info_Feb.xlsx", "./User_Info_Mar.xlsx")
monthnames <- strcapture("User_Info_(.*)\.xlsx", files, list(month = ""))
monthnames
# month
# 1 Jan
# 2 Feb
# 3 Mar
至此,我们已经从每个文件名中提取了月份名称。我发现 strcapture
(在 base R 中)比 gsub
更好,因为后者 returns 整个字符串如果没有匹配; base R 中的另一种选择是 regmatches(files, gregexpr(...))
,但这似乎比这里需要的要复杂一些。另一种选择是 stringr::str_extract
如果您已经在使用 stringr
and/or 其他 tidyverse 包,它可能更直观。
从这里,我们可以遍历文件以读入它们。
# out <- Map(function(mn, fn) transform(readxl::read_excel(fn), month = mn), monthnames$month, files)
set.seed(42)
out <- Map(function(mn, fn) transform(mtcars[sample(32,size=2),], month = mn), monthnames$month, files)
out
# $Jan
# mpg cyl disp hp drat wt qsec vs am gear carb month
# Chrysler Imperial 14.7 8 440 230 3.23 5.345 17.42 0 0 3 4 Jan
# Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 Jan
# $Feb
# mpg cyl disp hp drat wt qsec vs am gear carb month
# Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 Feb
# Pontiac Firebird 19.2 8 400 175 3.08 3.845 17.05 0 0 3 2 Feb
# $Mar
# mpg cyl disp hp drat wt qsec vs am gear carb month
# Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4 Mar
# Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 Mar
将该帧列表合并为一个帧是直接的:
do.call(rbind, out)
# mpg cyl disp hp drat wt qsec vs am gear carb month
# Jan.Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4 Jan
# Jan.Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 Jan
# Feb.Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 Feb
# Feb.Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2 Feb
# Mar.Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4 Mar
# Mar.Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 Mar
所有这些的替代方法可以使用 data.table::rbindlist
或 dplyr::bind_rows
,并直接分配“id”列:
# out <- Map(function(mn, fn) readxl::read_excel(fn), monthnames$month, files)
set.seed(42)
out <- Map(function(mn, fn) mtcars[sample(32,size=2),], monthnames$month, files)
data.table::rbindlist(out, idcol = "month")
# month mpg cyl disp hp drat wt qsec vs am gear carb
# <char> <num> <num> <num> <num> <num> <num> <num> <num> <num> <num> <num>
# 1: Jan 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
# 2: Jan 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
# 3: Feb 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
# 4: Feb 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
# 5: Mar 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
# 6: Mar 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
dplyr::bind_rows(out, .id = "month")
# month mpg cyl disp hp drat wt qsec vs am gear carb
# Chrysler Imperial Jan 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
# Hornet Sportabout Jan 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
# Mazda RX4 Feb 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
# Pontiac Firebird Feb 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
# Merc 280 Mar 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
# Hornet 4 Drive Mar 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
后两个之所以起作用,是因为当我之前调用 Map
时,传递给内部函数的第一个参数 (monthnames$month
) 用作 list
输出的名称,它这就是为什么您将 $Jan
等视为返回列表的元素的原因。当使用 idcol=
/.id=
时,rbindlist
和 bind_rows
都将这些名称用作“id”列。 (如果实际上没有“名称”存在,则两个函数都算在内。)
您可以像这样尝试 purrr
包:
files <- c("./User_Info_Jan.xlsx", "./User_Info_Feb.xlsx", "./User_Info_Mar.xlsx")
months <- c("Jan","Feb","Mar")
library(openxlsx)
library(purrr)
map2_dfr(files,months,function(x,y) read.xlsx(x) %>% mutate(Month=y))