如何在 R 中处理 2500 个 .csv 文件

Question

我想知道如何处理 2500 个具有相同列数（即 4 列）的 .csv 文件？我想导入这些文件，删除第一列和第二列，将剩余列的标题更改为 x 和 y，删除每个数据框中的重复项，最后将这些数据框另存为单独的 .csv 文件（2500 个文件）。我使用了以下脚本：

library(dplyr)
# Get all filenames
list.files(path = "D:/R_project", full.names = `TRUE) %>%`
  # Import all files
  purrr::map(readr::read_csv) %>%
  purrr::map(
    ~ .x %>%
      # Select columns and rename
      select(
        x = Col3,
        y = Col4
      ) %>% 
      # Remove duplicates
      distinct()
  ) %>% 
  # Save all files (same filename, but in a different folder)
  purrr::walk2(
    list.files("D:/R_project/Treated"),
    ~ readr::write_csv(x = .x, file = paste0("output/folder/", .y))
  )

但是，我最后收到了所有数据帧的错误（下面是其中一个数据帧的示例）：

Rows: 1579 Columns: 4
Column specification ---------------------------------------
Delimiter: ","

i Use `spec()` to retrieve the full column specification for this data.
i Specify the column types or set `show_col_types = FALSE` to quiet this message.
New names:                                                                                                          
* `` -> ...1

我该如何解决这个问题？有帮助吗？

Answer 1

在 R 中有很多方法可以做到这一点。下面是一个使用 dplyr 进行数据操作的示例，readr 用于 import/export 的 CSV 和 purrr 来处理同时处理所有文件。

library(dplyr)

# Get all filenames
list.files("path/to/your/csv/files", full.names = TRUE) %>%
  # Import all files
  purrr::map(readr::read_csv) %>%
  purrr::map(
    ~ .x %>% 
      # Select columns and rename
      select(
        x = <your x column>,
        y = <your y column>
      ) %>% 
      # Remove duplicates
      distinct()
  ) %>% 
  # Save all files (same filename, but in a different folder)
  purrr::walk2(
    list.files("path/to/your/csv/files"),
    ~ readr::write_csv(x = .x, file = paste0("output/folder/", .y))
  )

由于您没有向我们提供任何代码，您可能需要对此示例进行一些调整，但我希望这足以让您入门。

Answer 2

如果您不需要使用 R，有一个很棒的命令行实用程序可以处理 CSV，它可以满足您的所有需求，GoCSV。

我有两个小样本 CSV，file1.csv 和 file2.csv:

c1,c2,c3,c4
1,2,a,b
1,2,a,b
2,3,b,c

c1,c2,c3,c4
5,6,e,f
5,6,e,f
6,7,f,g

这个小脚本：

删除前两列
基于剩余 2 列的重复数据删除
重命名那两列

ls file1.csv file2.csv| while read CSV; do
    gocsv select -c c3,c4 $CSV |                     # cut out the first 2 columns
    gocsv uniq -c c3,c4 |                            # dedupe based on the remaining 2 columns combined
    gocsv rename -c c3,c4 -names c1,c2 > $CSV.fixed  # rename columns and save to tmp file
    mv $CSV.fixed $CSV                               # mv tmp back to old filename
done

“固定的”CSV 现在看起来像这样：

c1,c2
a,b
b,c

和

c1,c2
e,f
f,g

如何在 R 中处理 2500 个 .csv 文件

How to handle 2500 .csv files in R

csv

r

duplicates