读取多个 excel 个文件,添加一列,然后绑定

Reading in multiple excel files, adding a column, then binding

我有一系列 Excel 文件要读入 R,根据文件名添加日期列,然后绑定在一起。

文件的命名约定为User_Info_Jan、User_Info_Feb、User_Info_Mar。月份仅在文件名中引用,实际文件本身并未提及。 User_Info_Jan 文件的示例:

ID   Name
ABC  Joe Smith
DEF  Henry Cooper 
ZCS  Kelly Ma

有没有一种方法可以使用文件名中的模式(模式 = User_Info_)读取文件,然后在绑定之前添加一个名为“Month”的列来指示文件的月份一起?

月份列后的示例数据框:

ID   Name           Month
ABC  Joe Smith      January
DEF  Henry Cooper   January
ZCS  Kelly Ma       January

绑定后的示例数据框:

ID   Name           Usage Month
ABC  Joe Smith      January
DEF  Henry Cooper   January
ZCS  Kelly Ma       January
KFY  Lisa Schwartz  February
LFG  Alex Shah      March

我会使用 purrr 库中的 map() 函数来解决这个问题。

由于我们正在重新编辑文件,因此没有可重现的格式,我最近的代码示例如下:

# Get all the filenames (I assume this contains the month data in  your case)
GravfilesMap <- list.files("GravityModel/MapOut", full.names = T)

GravMap <-
  # Use Regex to select the string for the month for you I would try "_([a-zA-Z]+).xlsx" passed to the str_match function (this gets the month names as a column)
  (GravfilesMap %>% str_match("(\d+).csv$"))[,2] %>%
  # Convert to a data frame
  tibble %>% 
  # For each file_name read in the data to its own data frame (this will give on each row a month name and then a nested dataframe)
  # I have used read_csv here you will use something like read_xls
  # The order of the files is the same as the order of our months as we are importing them in the order specified by the list
  mutate(file_contents = map(GravfilesMap, ~read_csv(., col_names = F)))

# Unnest the dataframes to appear in the form that was requested
GravMap <- GravMap %>% unnest()

有关类似方法的详细信息,请参见 https://clauswilke.com/blog/2016/06/13/reading-and-combining-many-tidy-data-files-in-r/

我将使用假文件名进行演示,但我建议您 运行 使用相同结构注释掉真正的命令。我假设 .xlsx 用于“excel 个文件”,但这同样适用于 .csv(只需更新模式)。

# files <- list.files(path = ".", pattern = "User_Info_.*\.xlsx$", full.names = TRUE)
files <- c("./User_Info_Jan.xlsx", "./User_Info_Feb.xlsx", "./User_Info_Mar.xlsx")
monthnames <- strcapture("User_Info_(.*)\.xlsx", files, list(month = ""))
monthnames
#   month
# 1   Jan
# 2   Feb
# 3   Mar

至此,我们已经从每个文件名中提取了月份名称。我发现 strcapture(在 base R 中)比 gsub 更好,因为后者 returns 整个字符串如果没有匹配; base R 中的另一种选择是 regmatches(files, gregexpr(...)),但这似乎比这里需要的要复杂一些。另一种选择是 stringr::str_extract 如果您已经在使用 stringr and/or 其他 tidyverse 包,它可能更直观。

从这里,我们可以遍历文件以读入它们。

# out <- Map(function(mn, fn) transform(readxl::read_excel(fn), month = mn), monthnames$month, files)
set.seed(42)
out <- Map(function(mn, fn) transform(mtcars[sample(32,size=2),], month = mn), monthnames$month, files)
out
# $Jan
#                    mpg cyl disp  hp drat    wt  qsec vs am gear carb month
# Chrysler Imperial 14.7   8  440 230 3.23 5.345 17.42  0  0    3    4   Jan
# Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2   Jan
# $Feb
#                   mpg cyl disp  hp drat    wt  qsec vs am gear carb month
# Mazda RX4        21.0   6  160 110 3.90 2.620 16.46  0  1    4    4   Feb
# Pontiac Firebird 19.2   8  400 175 3.08 3.845 17.05  0  0    3    2   Feb
# $Mar
#                 mpg cyl  disp  hp drat    wt  qsec vs am gear carb month
# Merc 280       19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4   Mar
# Hornet 4 Drive 21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1   Mar

将该帧列表合并为一个帧是直接的:

do.call(rbind, out)
#                        mpg cyl  disp  hp drat    wt  qsec vs am gear carb month
# Jan.Chrysler Imperial 14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4   Jan
# Jan.Hornet Sportabout 18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2   Jan
# Feb.Mazda RX4         21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4   Feb
# Feb.Pontiac Firebird  19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2   Feb
# Mar.Merc 280          19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4   Mar
# Mar.Hornet 4 Drive    21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1   Mar

所有这些的替代方法可以使用 data.table::rbindlistdplyr::bind_rows,并直接分配“id”列:

# out <- Map(function(mn, fn) readxl::read_excel(fn), monthnames$month, files)
set.seed(42)
out <- Map(function(mn, fn) mtcars[sample(32,size=2),], monthnames$month, files)

data.table::rbindlist(out, idcol = "month")
#     month   mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#    <char> <num> <num> <num> <num> <num> <num> <num> <num> <num> <num> <num>
# 1:    Jan  14.7     8 440.0   230  3.23 5.345 17.42     0     0     3     4
# 2:    Jan  18.7     8 360.0   175  3.15 3.440 17.02     0     0     3     2
# 3:    Feb  21.0     6 160.0   110  3.90 2.620 16.46     0     1     4     4
# 4:    Feb  19.2     8 400.0   175  3.08 3.845 17.05     0     0     3     2
# 5:    Mar  19.2     6 167.6   123  3.92 3.440 18.30     1     0     4     4
# 6:    Mar  21.4     6 258.0   110  3.08 3.215 19.44     1     0     3     1

dplyr::bind_rows(out, .id = "month")
#                   month  mpg cyl  disp  hp drat    wt  qsec vs am gear carb
# Chrysler Imperial   Jan 14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
# Hornet Sportabout   Jan 18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
# Mazda RX4           Feb 21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
# Pontiac Firebird    Feb 19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
# Merc 280            Mar 19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
# Hornet 4 Drive      Mar 21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1

后两个之所以起作用,是因为当我之前调用 Map 时,传递给内部函数的第一个参数 (monthnames$month) 用作 list 输出的名称,它这就是为什么您将 $Jan 等视为返回列表的元素的原因。当使用 idcol=/.id= 时,rbindlistbind_rows 都将这些名称用作“id”列。 (如果实际上没有“名称”存在,则两个函数都算在内。)

您可以像这样尝试 purrr 包:

files <- c("./User_Info_Jan.xlsx", "./User_Info_Feb.xlsx", "./User_Info_Mar.xlsx")
months <- c("Jan","Feb","Mar")

library(openxlsx)
library(purrr)
map2_dfr(files,months,function(x,y) read.xlsx(x) %>% mutate(Month=y))