为什么 haven::write_dta() 会增加文件大小并且可以更改?
Why does haven::write_dta() inflate file size and can it be changed?
有时我需要将 SPSS 文件转换为 DTA 文件。通常我用Stat/Transfer,但我想也许我可以用R来省钱。
但是,当我使用 haven 包传输文件时,生成的文件大小比我使用 Stat/Transfer.
时 显着 大
例如,这里有一个 .sav file I found on the internet。是85kb。
使用Stat/Transfer将其转换为一个更小的 47kb .dta 文件。
但是,当我 运行 这段代码时,我得到一个 118kb 的 .dta 文件。这是 Stat/Transfer 产品的 2.5 倍。
from.sav <- haven::read_sav("PsychBike.sav")
haven::write_dta(from.sav, "PsychBikeFromHaven.dta")
有什么办法可以使 haven::write_dta()
的输出变小吗?
这是因为 write_dta()
没有 compress
。即,write_dta()
经常选择过大的数据存储类型。下面是我工作中的一个极端但真实的例子。 (文件名和变量名已编辑。)
注意文件大小。它从 1 Mb 减少到 6 kb。尺寸缩小 99.4%。真实数据集实际上有数百万个观察值——所以我很难使用 write_dta()
将其转换为 dta
。可能需要在 ReadStat
级别进行调整。
. desc, size
Contains data from v1.dta
obs: 100
vars: 22 04 Sep 2019 10:19
size: 1,032,900
-------------------------------------------------------------------------------
storage display value
variable name type format label variable label
-------------------------------------------------------------------------------
var1 double %10.0g
var2 str1 %-9s
var3 double %td
var4 double %td
var5 str4 %-9s
var6 str1 %-9s
var7 str2045 %-9s
var8 str2045 %-9s
var9 str2045 %-9s
var10 str2045 %-9s
var11 str2045 %-9s
var12 str5 %-9s
var13 double %10.0g
var14 double %td
var15 double %10.0g
var16 str3 %-9s
var17 double %10.0g
var18 double %10.0g
var19 double %10.0g
var20 double %10.0g
var21 double %10.0g
var22 str2 %-9s
-------------------------------------------------------------------------------
Sorted by:
Note: Dataset has changed since last saved.
r; t=0.00 10:27:24
. compress
variable var1 was double now long
variable var3 was double now int
variable var4 was double now int
variable var14 was double now int
variable var17 was double now byte
variable var18 was double now long
variable var19 was double now byte
variable var20 was double now byte
variable var7 was str2045 now str1
variable var8 was str2045 now str1
variable var9 was str2045 now str1
variable var10 was str2045 now str1
variable var11 was str2045 now str1
(1,026,700 bytes saved)
r; t=0.00 10:27:34
. desc, size
Contains data from v2.dta
obs: 100
vars: 22 04 Sep 2019 10:19
size: 6,200
-------------------------------------------------------------------------------
storage display value
variable name type format label variable label
-------------------------------------------------------------------------------
var1 long %10.0g
var2 str1 %-9s
var3 int %td
var4 int %td
var5 str4 %-9s
var6 str1 %-9s
var7 str1 %-9s
var8 str1 %-9s
var9 str1 %-9s
var10 str1 %-9s
var11 str1 %-9s
var12 str5 %-9s
var13 double %10.0g
var14 int %td
var15 double %10.0g
var16 str3 %-9s
var17 byte %10.0g
var18 long %10.0g
var19 byte %10.0g
var20 byte %10.0g
var21 double %10.0g
var22 str2 %-9s
-------------------------------------------------------------------------------
Sorted by:
Note: Dataset has changed since last saved.
r; t=0.00 10:27:37
有时我需要将 SPSS 文件转换为 DTA 文件。通常我用Stat/Transfer,但我想也许我可以用R来省钱。
但是,当我使用 haven 包传输文件时,生成的文件大小比我使用 Stat/Transfer.
时 显着 大例如,这里有一个 .sav file I found on the internet。是85kb。
使用Stat/Transfer将其转换为一个更小的 47kb .dta 文件。
但是,当我 运行 这段代码时,我得到一个 118kb 的 .dta 文件。这是 Stat/Transfer 产品的 2.5 倍。
from.sav <- haven::read_sav("PsychBike.sav")
haven::write_dta(from.sav, "PsychBikeFromHaven.dta")
有什么办法可以使 haven::write_dta()
的输出变小吗?
这是因为 write_dta()
没有 compress
。即,write_dta()
经常选择过大的数据存储类型。下面是我工作中的一个极端但真实的例子。 (文件名和变量名已编辑。)
注意文件大小。它从 1 Mb 减少到 6 kb。尺寸缩小 99.4%。真实数据集实际上有数百万个观察值——所以我很难使用 write_dta()
将其转换为 dta
。可能需要在 ReadStat
级别进行调整。
. desc, size
Contains data from v1.dta
obs: 100
vars: 22 04 Sep 2019 10:19
size: 1,032,900
-------------------------------------------------------------------------------
storage display value
variable name type format label variable label
-------------------------------------------------------------------------------
var1 double %10.0g
var2 str1 %-9s
var3 double %td
var4 double %td
var5 str4 %-9s
var6 str1 %-9s
var7 str2045 %-9s
var8 str2045 %-9s
var9 str2045 %-9s
var10 str2045 %-9s
var11 str2045 %-9s
var12 str5 %-9s
var13 double %10.0g
var14 double %td
var15 double %10.0g
var16 str3 %-9s
var17 double %10.0g
var18 double %10.0g
var19 double %10.0g
var20 double %10.0g
var21 double %10.0g
var22 str2 %-9s
-------------------------------------------------------------------------------
Sorted by:
Note: Dataset has changed since last saved.
r; t=0.00 10:27:24
. compress
variable var1 was double now long
variable var3 was double now int
variable var4 was double now int
variable var14 was double now int
variable var17 was double now byte
variable var18 was double now long
variable var19 was double now byte
variable var20 was double now byte
variable var7 was str2045 now str1
variable var8 was str2045 now str1
variable var9 was str2045 now str1
variable var10 was str2045 now str1
variable var11 was str2045 now str1
(1,026,700 bytes saved)
r; t=0.00 10:27:34
. desc, size
Contains data from v2.dta
obs: 100
vars: 22 04 Sep 2019 10:19
size: 6,200
-------------------------------------------------------------------------------
storage display value
variable name type format label variable label
-------------------------------------------------------------------------------
var1 long %10.0g
var2 str1 %-9s
var3 int %td
var4 int %td
var5 str4 %-9s
var6 str1 %-9s
var7 str1 %-9s
var8 str1 %-9s
var9 str1 %-9s
var10 str1 %-9s
var11 str1 %-9s
var12 str5 %-9s
var13 double %10.0g
var14 int %td
var15 double %10.0g
var16 str3 %-9s
var17 byte %10.0g
var18 long %10.0g
var19 byte %10.0g
var20 byte %10.0g
var21 double %10.0g
var22 str2 %-9s
-------------------------------------------------------------------------------
Sorted by:
Note: Dataset has changed since last saved.
r; t=0.00 10:27:37