R 在导入有效的 .zip 文件时抛出错误

R throws an error when importing a valid .zip file

我通过 this link 手动下载了一个 zip 文件,我能够保存、解压缩并打开它的内容(一个 .csv 文件),一点问题都没有。

但是,当我尝试导入 R 时遇到问题:

test_file <- paste0(dest_path,"/test.zip") ### ok
download.file("https://www.portaltransparencia.gov.br/download-de-dados/despesas-execucao/202001"
, destfile = test_file ) ### ok
    
unzip(test_file, exdir = paste0(dest_path,"/unzipped")) #### ERROR!!!

报告的错误是:

Error in unzip(test_file, exdir = paste0(dest_path, "/unzipped")) : 
  zip error: `Cannot open zip file `x:\test.zip` for reading` in file `zip.c:140`

我也试了data.table::fread(),好像也是一样的错误。

OBS: 比较两个文件的字符流,我发现它们是完全一样的,但是通过download.file导入的文件多了一个空行。

可能发生了什么?

dt <- fread("https://www.portaltransparencia.gov.br/download-de-dados/despesas-execucao/202001"
             , verbose = T) 

输出:

OpenMP version (_OPENMP)       201511
  omp_get_num_procs()            4
  R_DATATABLE_NUM_PROCS_PERCENT  unset (default 50)
  R_DATATABLE_NUM_THREADS        unset
  R_DATATABLE_THROTTLE           unset (default 1024)
  omp_get_thread_limit()         2147483647
  omp_get_max_threads()          4
  OMP_THREAD_LIMIT               unset
  OMP_NUM_THREADS                unset
  RestoreAfterFork               true
  data.table is using 2 threads with throttle==1024. See ?setDTthreads.
 Downloaded 5333397 bytes...Input contains no \n. Taking this to be a filename to open
[01] Check arguments
  Using 2 threads (omp_get_max_threads()=4, nth=2)
  NAstrings = [<<NA>>]
  None of the NAstrings look like numbers.
  show progress = 1
  0/1 column will be read as integer
[02] Opening the file
  Opening file C:\Users\fabio\AppData\Local\Temp\RtmpMZnoDr\file5b463781124.
  File opened, size = 5.062MB (5307463 bytes).
  Memory mapped ok
[03] Detect and skip BOM
  Last byte(s) of input found to be 0x00 (NUL) and removed.
[04] Arrange mmap to be [=13=] terminated
  \n has been found in the input and different lines can end with different line endings (e.g. mixed \n and \r\n in one file). This is common and ideal.
  File ends abruptly with 'P'. Final end-of-line is missing. Using cow page to write 0 to the last byte.
[05] Skipping initial rows if needed
  Positioned on line 1 starting: <<PK>>
[06] Detect separator, quoting rule, and ncolumns
  Detecting sep automatically ...
  sep=','  with 2 lines of 2 fields using quote rule 0
  sep='|'  with 2 lines of 3 fields using quote rule 0
  Detected 3 columns on line 7. This line is either column names or first data row. Line starts as: <<µTŸ   I>>
  Quote rule picked = 0
  fill=false and the most number of columns found is 3
[07] Detect column types, good nrow estimate and whether first row is column names
  Number of sampling jump points = 100 because (5305832 bytes from row 1 to eof) / (2 * 722 jump0size) == 3674
  A line with too-few fields (2/3) was found on line 2 of sample jump 0. 
  Type codes (jump 000)    : CCC  Quote rule 0
Types in 1st data row match types in 2nd data row but previous row has 2 fields. Taking previous row as column names.  All rows were sampled since file is small so we know nrow=1 exactly
[08] Assign column names
[09] Apply user overrides on column types
Error in fread("https://www.portaltransparencia.gov.br/download-de-dados/despesas-execucao/202001",  : 
  embedded nul in string: '4+¹a¶2ÛWro8v[=13=]3z§7<ÄRþð<\tMËåtÄãVëª\bÓ*¡†ÿ]Ü[=13=]5£â} ¨¤WÚSøñþcLLÛbĺŽçP6üügLô1ÄŒg3ºÜO\fÎ?]Î-}3þ¿µÂ ~Ž´X"x4Çï±\¼¹\rø5æy?‹6wQ¦&0'7¿aX©[=13=]4,߇Õ"ðá4¸ï50Cêm[=13=]6nŸ'¹(ôà5vï]èóžÝ±[=13=][”íÎc7à[=13=]1ª”ë¸&c[>øÂÎÓâU+ÁåL6]'
In addition: Warning messages:
1: In fread("https://www.portaltransparencia.gov.br/download-de-dados/despesas-execucao/202001",  :
  Previous fread() session was not cleaned up properly. Cleaned up ok at the beginning of this fread() call.
2: In fread("https://www.portaltransparencia.gov.br/download-de-dados/despesas-execucao/202001",  :
  Detected 2 column names but the data has 3 columns (i.e. invalid file). Added 1 extra default column name for the first column which is guessed to be row names or an index. Use setnames() afterwards if this guess is not correct, or fix the file write command that created the file to create a valid file.

前期:

download.file("https://www.portaltransparencia.gov.br/download-de-dados/despesas-execucao/202001"
, mode = "wb"
, destfile = test_file )

download.file (on windows) 尝试根据文件名区分 text/binary 判断;如果它不是“已知”,则默认为文本下载,这会损坏二进制文件。请注意,这主要用于 windows:

The choice of binary transfer ('mode = "wb"' or '"ab"') is important on Windows, since unlike Unix-alikes it does distinguish between text and binary files and for text transfers changes '\n' line endings to '\r\n' (aka 'CRLF').

On Windows, if 'mode' is not supplied ('missing()') and 'url' ends in one of '.gz', '.bz2', '.xz', '.tgz', '.zip', '.jar', '.rda', '.rds' or '.RData', 'mode = "wb"' is set so that a binary transfer is done to help unwary users.

在您的情况下,URL 没有这样的扩展名,因此它默认为 mode = "w"