读取多个 excel 个文件，添加一列，然后绑定

Question

我有一系列 Excel 文件要读入 R，根据文件名添加日期列，然后绑定在一起。

文件的命名约定为User_Info_Jan、User_Info_Feb、User_Info_Mar。月份仅在文件名中引用，实际文件本身并未提及。 User_Info_Jan 文件的示例：

ID   Name
ABC  Joe Smith
DEF  Henry Cooper 
ZCS  Kelly Ma

有没有一种方法可以使用文件名中的模式（模式 = User_Info_）读取文件，然后在绑定之前添加一个名为“Month”的列来指示文件的月份一起？

月份列后的示例数据框：

ID   Name           Month
ABC  Joe Smith      January
DEF  Henry Cooper   January
ZCS  Kelly Ma       January

绑定后的示例数据框：

ID   Name           Usage Month
ABC  Joe Smith      January
DEF  Henry Cooper   January
ZCS  Kelly Ma       January
KFY  Lisa Schwartz  February
LFG  Alex Shah      March

Answer 1

我会使用 purrr 库中的 map() 函数来解决这个问题。

由于我们正在重新编辑文件，因此没有可重现的格式，我最近的代码示例如下：

# Get all the filenames (I assume this contains the month data in  your case)
GravfilesMap <- list.files("GravityModel/MapOut", full.names = T)

GravMap <-
  # Use Regex to select the string for the month for you I would try "_([a-zA-Z]+).xlsx" passed to the str_match function (this gets the month names as a column)
  (GravfilesMap %>% str_match("(\d+).csv$"))[,2] %>%
  # Convert to a data frame
  tibble %>% 
  # For each file_name read in the data to its own data frame (this will give on each row a month name and then a nested dataframe)
  # I have used read_csv here you will use something like read_xls
  # The order of the files is the same as the order of our months as we are importing them in the order specified by the list
  mutate(file_contents = map(GravfilesMap, ~read_csv(., col_names = F)))

# Unnest the dataframes to appear in the form that was requested
GravMap <- GravMap %>% unnest()

有关类似方法的详细信息，请参见 https://clauswilke.com/blog/2016/06/13/reading-and-combining-many-tidy-data-files-in-r/

Answer 2

我将使用假文件名进行演示，但我建议您运行使用相同结构注释掉真正的命令。我假设 .xlsx 用于“excel 个文件”，但这同样适用于 .csv（只需更新模式）。

# files <- list.files(path = ".", pattern = "User_Info_.*\.xlsx$", full.names = TRUE)
files <- c("./User_Info_Jan.xlsx", "./User_Info_Feb.xlsx", "./User_Info_Mar.xlsx")
monthnames <- strcapture("User_Info_(.*)\.xlsx", files, list(month = ""))
monthnames
#   month
# 1   Jan
# 2   Feb
# 3   Mar

至此，我们已经从每个文件名中提取了月份名称。我发现 strcapture（在 base R 中）比 gsub 更好，因为后者 returns 整个字符串如果没有匹配； base R 中的另一种选择是 regmatches(files, gregexpr(...))，但这似乎比这里需要的要复杂一些。另一种选择是 stringr::str_extract 如果您已经在使用 stringr and/or 其他 tidyverse 包，它可能更直观。

从这里，我们可以遍历文件以读入它们。

# out <- Map(function(mn, fn) transform(readxl::read_excel(fn), month = mn), monthnames$month, files)
set.seed(42)
out <- Map(function(mn, fn) transform(mtcars[sample(32,size=2),], month = mn), monthnames$month, files)
out
# $Jan
#                    mpg cyl disp  hp drat    wt  qsec vs am gear carb month
# Chrysler Imperial 14.7   8  440 230 3.23 5.345 17.42  0  0    3    4   Jan
# Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2   Jan
# $Feb
#                   mpg cyl disp  hp drat    wt  qsec vs am gear carb month
# Mazda RX4        21.0   6  160 110 3.90 2.620 16.46  0  1    4    4   Feb
# Pontiac Firebird 19.2   8  400 175 3.08 3.845 17.05  0  0    3    2   Feb
# $Mar
#                 mpg cyl  disp  hp drat    wt  qsec vs am gear carb month
# Merc 280       19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4   Mar
# Hornet 4 Drive 21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1   Mar

将该帧列表合并为一个帧是直接的：

do.call(rbind, out)
#                        mpg cyl  disp  hp drat    wt  qsec vs am gear carb month
# Jan.Chrysler Imperial 14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4   Jan
# Jan.Hornet Sportabout 18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2   Jan
# Feb.Mazda RX4         21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4   Feb
# Feb.Pontiac Firebird  19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2   Feb
# Mar.Merc 280          19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4   Mar
# Mar.Hornet 4 Drive    21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1   Mar

所有这些的替代方法可以使用 data.table::rbindlist 或 dplyr::bind_rows，并直接分配“id”列：

# out <- Map(function(mn, fn) readxl::read_excel(fn), monthnames$month, files)
set.seed(42)
out <- Map(function(mn, fn) mtcars[sample(32,size=2),], monthnames$month, files)

data.table::rbindlist(out, idcol = "month")
#     month   mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#    <char> <num> <num> <num> <num> <num> <num> <num> <num> <num> <num> <num>
# 1:    Jan  14.7     8 440.0   230  3.23 5.345 17.42     0     0     3     4
# 2:    Jan  18.7     8 360.0   175  3.15 3.440 17.02     0     0     3     2
# 3:    Feb  21.0     6 160.0   110  3.90 2.620 16.46     0     1     4     4
# 4:    Feb  19.2     8 400.0   175  3.08 3.845 17.05     0     0     3     2
# 5:    Mar  19.2     6 167.6   123  3.92 3.440 18.30     1     0     4     4
# 6:    Mar  21.4     6 258.0   110  3.08 3.215 19.44     1     0     3     1

dplyr::bind_rows(out, .id = "month")
#                   month  mpg cyl  disp  hp drat    wt  qsec vs am gear carb
# Chrysler Imperial   Jan 14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
# Hornet Sportabout   Jan 18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
# Mazda RX4           Feb 21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
# Pontiac Firebird    Feb 19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
# Merc 280            Mar 19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
# Hornet 4 Drive      Mar 21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1

后两个之所以起作用，是因为当我之前调用 Map 时，传递给内部函数的第一个参数 (monthnames$month) 用作 list 输出的名称，它这就是为什么您将 $Jan 等视为返回列表的元素的原因。当使用 idcol=/.id= 时，rbindlist 和 bind_rows 都将这些名称用作“id”列。（如果实际上没有“名称”存在，则两个函数都算在内。）

Answer 3

您可以像这样尝试 purrr 包：

files <- c("./User_Info_Jan.xlsx", "./User_Info_Feb.xlsx", "./User_Info_Mar.xlsx")
months <- c("Jan","Feb","Mar")

library(openxlsx)
library(purrr)
map2_dfr(files,months,function(x,y) read.xlsx(x) %>% mutate(Month=y))

读取多个 excel 个文件，添加一列，然后绑定

Reading in multiple excel files, adding a column, then binding

binding

r