为什么 haven::write_dta() 会增加文件大小并且可以更改?

Why does haven::write_dta() inflate file size and can it be changed?

有时我需要将 SPSS 文件转换为 DTA 文件。通常我用Stat/Transfer,但我想也许我可以用R来省钱。

但是,当我使用 haven 包传输文件时,生成的文件大小比我使用 Stat/Transfer.

显着

例如,这里有一个 .sav file I found on the internet。是85kb。

使用Stat/Transfer将其转换为一个更小的 47kb .dta 文件。

但是,当我 运行 这段代码时,我得到一个 118kb 的 .dta 文件。这是 Stat/Transfer 产品的 2.5 倍。

from.sav <- haven::read_sav("PsychBike.sav")
haven::write_dta(from.sav, "PsychBikeFromHaven.dta")

有什么办法可以使 haven::write_dta() 的输出变小吗?

这是因为 write_dta() 没有 compress。即,write_dta() 经常选择过大的数据存储类型。下面是我工作中的一个极端但真实的例子。 (文件名和变量名已编辑。)

注意文件大小。它从 1 Mb 减少到 6 kb。尺寸缩小 99.4%。真实数据集实际上有数百万个观察值——所以我很难使用 write_dta() 将其转换为 dta。可能需要在 ReadStat 级别进行调整。

. desc, size

Contains data from v1.dta
  obs:           100
 vars:            22                          04 Sep 2019 10:19
 size:     1,032,900
-------------------------------------------------------------------------------
              storage   display    value
variable name   type    format     label      variable label
-------------------------------------------------------------------------------
var1            double  %10.0g
var2            str1    %-9s
var3            double  %td
var4            double  %td
var5            str4    %-9s
var6            str1    %-9s
var7            str2045 %-9s
var8            str2045 %-9s
var9            str2045 %-9s
var10           str2045 %-9s
var11           str2045 %-9s
var12           str5    %-9s
var13           double  %10.0g
var14           double  %td
var15           double  %10.0g
var16           str3    %-9s
var17           double  %10.0g
var18           double  %10.0g
var19           double  %10.0g
var20           double  %10.0g
var21           double  %10.0g
var22           str2    %-9s
-------------------------------------------------------------------------------
Sorted by:
     Note: Dataset has changed since last saved.
r; t=0.00 10:27:24

. compress
  variable var1 was double now long
  variable var3 was double now int
  variable var4 was double now int
  variable var14 was double now int
  variable var17 was double now byte
  variable var18 was double now long
  variable var19 was double now byte
  variable var20 was double now byte
  variable var7 was str2045 now str1
  variable var8 was str2045 now str1
  variable var9 was str2045 now str1
  variable var10 was str2045 now str1
  variable var11 was str2045 now str1
  (1,026,700 bytes saved)
r; t=0.00 10:27:34

. desc, size

Contains data from v2.dta
  obs:           100
 vars:            22                          04 Sep 2019 10:19
 size:         6,200
-------------------------------------------------------------------------------
              storage   display    value
variable name   type    format     label      variable label
-------------------------------------------------------------------------------
var1            long    %10.0g
var2            str1    %-9s
var3            int     %td
var4            int     %td
var5            str4    %-9s
var6            str1    %-9s
var7            str1    %-9s
var8            str1    %-9s
var9            str1    %-9s
var10           str1    %-9s
var11           str1    %-9s
var12           str5    %-9s
var13           double  %10.0g
var14           int     %td
var15           double  %10.0g
var16           str3    %-9s
var17           byte    %10.0g
var18           long    %10.0g
var19           byte    %10.0g
var20           byte    %10.0g
var21           double  %10.0g
var22           str2    %-9s
-------------------------------------------------------------------------------
Sorted by:
     Note: Dataset has changed since last saved.
r; t=0.00 10:27:37