避风港:read_dta错误()"Failed to parse /Users/folder/my_data.dta: Unable to allocate memory."

Haven: read_dta error() "Failed to parse /Users/folder/my_data.dta: Unable to allocate memory."

我有一个相对较大的 .dta 文件,包含 1280000 个观察值,在 Stata 中运行良好,但我无法将其导入 R。 数据是用Stata 15创建的,数据包含strL或str#,#>244个变量,不能保存为Stata 12格式。

我正在尝试使用 haven 包通过 read_dta() 导入保存的数据,但它给我以下错误消息:"Failed to parse /Users/folder/my_data.dta: Unable to allocate memory." 有谁知道可能导致此问题的原因以及如何克服它以便能够在 R 中导入数据?

我尝试通过多种方式解决这个问题,但 none 我的尝试似乎奏效了。

  1. 首先我尝试使用 Sys.setenv('R_MAX_VSIZE'=32000000000) 扩展我的 r 环境的内存大小,但是当我尝试导入数据时控制台报告相同的错误。该问题似乎与我在 R 中的内存大小无关。

  2. 我尝试在 Stata 中使用 saveold my_data13, version(13) 以 Stata 13 格式保存数据,但尝试使用 haven 将其导入 R 仍然会产生相同的错误消息。

  3. 我尝试使用 readstata13 函数 read.dta13(my_data13),但这经常导致 R 崩溃。

奇怪的是,只需双击它,我就能在 Stata 中正确打开数据。

有人对如何解决这个问题有什么建议吗?对 a) 错误消息的含义以及如何解决它的任何见解 2) 能够操作 stata15 文件的替代包 c) 能够在 R 中打开数据的方法将是最受欢迎的。

非常感谢您的帮助

此致

我只想总结评论中提到的所有内容。

official Stata documentation,我们有以下内容:

strL variables can be 0 to 2-billion bytes long. strL variables are not required to be longer than 2,045 bytes. str# variables can store strings of up to 2,045 bytes, so strL and str# overlap. This overlap is comparable to the overlap of the numeric types int and float. Any number that can be stored as an int can be stored as a float. Similarly, any string that can be stored as a str#, can be stored as a strL. The reverse is not true. In addition, strL variables can hold binary strings, whereas str# variables can only hold text strings. Thus the analogy between str#/strL and int/float is exact. There will be occasions when you will want to use strL variables in preference to str# variables, just as there are occasions when you will want to use float variables in preference to int variables.

以上信息让我在尝试复制您的问题之前推断出两点。

  1. 在您的情况下,没有必要使用 strL

  2. strL 的使用是问题的根源,可能导致 haven 库出现兼容性问题。

但是,在尝试复制您描述的内容后,我得出了不同的结论。


请仔细考虑以下模拟您的问题的代码。

version 16
clear all 
set obs 1
*4200 character string generated here: http://www.unit-conversion.info/texttools/random-string-generator/
gen strL str_var_in_strL = "eSw0qZcVs5DHU2GxgRo1Seo9uTwJ0MvHXyYUQidJMRWw8KW1310Ec242O6D4xrLziO4c56WgluSddTy0Q64QapkwGgOMZdy8ru0fyss0nwJvF4M3kBjYGF00ZsvQGYt4DjF51R3vxTzUx4xlApKwaoRADIgFlXvBh2Bug0VVhmXR3uInHDfpmID57kVWiyxX1gELdyPMVzJWizEHVx2GpjBsm1UdRphDdukFtFrnkr1HFRXBekxHkW3uOCHz0wnyDBfwitDGHosctRrWPhIjujnoalOaHkI5jbnENSNJEsOdGohoe5QKZIxtXmVbD4l8m8wLCbuSjZLw8NzU5vjPX57T2yWWasdFMIHk3kFipT0CG3dNForECS8UiW6ZWSIEmO2V62uakfrxTsRb9fIFVBUIHpGizeR0b27OnfSVB2wE2Ix0ij7kR19jz0wIh35fbwkJWqLq93pfHEtGu0FTb8H5A4XNOcR8chEAQBI7zV3rosSGnSP2h9QZtuSAcz1TrRHNMpCguvNf1DD72TCfCaiBXyflOCre7f5zchLA7k2cQ5qi4fBMVc9GnAdGB2vnjFeFlwaUD0AEUhfSJJINRQ2CKfJegqUL0jBgHBVy5cYCNxsP8Gu8NXRUo6vvyiTJMDcBkL0JKNOT4usSDi4v86cJNzQQa3ArafRzOv1RFz8BfI7pP7rXDLD6d1Z1miCqTZ8UtJBVQ0Z0eCQmTrlAvlu5busOjcAl4ZV7THH6qCV8tI53zh1THBfjnEgoPxy8UIaIK6tXDUM4RFMMd1366324mJEVwyvc5CWgzPian39Q3GFLl6zXCfD4pw7rSUmH5CNOmKgPihxPbV9NSBxiwVK3M07KFS2hZbf3ZDB4CBSJV9geFWKZlR3XNrsPudQgkpsdywNNjZTDwD2RiHF7kQAgyEW7q1w42OC2IbreBBtiPekx6yzCEWBEokLwfhrhbOnDwcnFmfKjnrxCbqypXrSnyvrUP2nUQ9vBmdxCqiVLrBHuDi6Wv2U4vyZ7dTqk84WmnwACXo5PbYY2dmhtjscLMpRw4Q6xVUEWC3qPMnQkbI1UKEq1NfOrF0X8nC0rqrwHQuNuJqHuebJj5AMXVgyZWTaqYIb4gkbGaEze3wNmHbbj1q2bmumiwd6RZRSdx7U3ZwozO9kTkZ69NHFSa2QDi8GrhgvBDMshJVaOR9K8tWcpa2QrFD7cI0ZqzneLXHXm6LsOmtZPFmikKfyts1pASGwZ8DzuWfT3j0daNmyk6y0HwHwM98KOyeuSXnQJOJzunAXkidv90hrgviWUhP70Nrx527JI7vpRr2dClBBnzO2O7YwjTdKTrmfcPs9z4iLeroo20Jg9ODHjvUYWtLRTOKrgvYAgywkj2PoVdwmuYK32UKcqH5EdhPHxWarjsqUuBb40u6nUGIQ0YS7ZuzsnVDerB8hO3rCl0FlMMYgRh4vdPcEG23JQoIwTvdujULg6Lpplyt6yK16UhVSklj6aVNIoA4zr51dULyOzWF9ZqlZz7l90QpXvLuRD9Elr5gxWvNW4fvkCAU3kpEv0s7gHS7ytjNxm0WLk74bN1iP8ZjcxXXBqwtatCo9e1Ayc59VYR9RVxtfvilb038WpHglhWEZoK91rumPSFiCJWUmlkL6P4SAbz5b6LDdW9ybiN8zdZmNtQ2px556d7DF5RRcXgLocLH37Uh6uU9cz2wmWRrcJS4rO9MkUe6KSuVjVLXSsk6J1bnvvagWl4BkY8ZPm0iBg4XXTkRAjfVgnfex1hee47b6k9c5gdS6AJSVazCPpXQJlGJ7NpyAn3hXdHkhaGtokTmne6Zag8DterOyDldPXHXwrG7PgtsmREc0VugLVPrYEbdf9QMHBtGQLwQz04Gyg2lspZ5HbGnOkfI0MTanuMN7XnWdcGBko9gmQKbpONgPqg8POcpxG2aRefswG090hvYKj5gzp3r1nitZZhBm8KDUT8P2Wy06hPxrkZinMGmBv2SIDegXr5uzceHymEnyMQZINS96QCyTiV7z1X1NZ3IBDfVPZTZ9bRxpKyMbAnYzFhx9PYSkescyMMtsGOEli1gFp2PWcqO4bpj0EnKjgWf9ae2R5nDKIkVbsNRCik3JrCM7WjHPfwdZSiA335Eyl0yoHQWjp6YJrR8ykOtw3zL2XHa2ilKIRSypG5dtDwjuqLI1fb7fB5wiG3LuowKqam8HY86aDsuu0DkpED9mAxoSvE7V6WPs5ptg31yoUOgGK1rvGtdpY7CHkaBmmv0jKNYjcZiuET6Q0If2IO36HafXJN8onjMvYadAypEY4IAxkmU2yemCFQkdDBuhr8G4DWcGO46W24QOvMghH6k3HVHeDgj5dNXKIz3rqbCC6tSNMptuQhoM2eWnwBFw6PzNIqKwB6qxbJMs9wxqvDEJlQkKgMZN4HJu1hNFpXIePLN8dSsV8xhgb1pxwFaqhLQMoYXcgobOdcrb7DDpJFbWhUXKn3WHEjO8nk4EAuNmIUdyfWxwxPPmTLEyohT7QrcjvRph82n64aRJIcyDPCho8pqtCTve84PAp4jIeechI8sl92e94jsX7XTZu3LaqDGkEEtcmp69ZqPA5Ev9NZv5ovmWNiu39kKu7QW0YnXGCvorirdScdCow4NyLgnpoAEG0FPz52oN7xU5xyTgY0x5Hel5GDsA28Qoy463tyBNuT51gQROr9XqgiM9Voq0ax1vI8QKFMLXwaLtHwPC7TkFtJbXYNmPt2kXzQ1EjLq0DGiyKd0BFMww3zpEjeSUS7KhCrf5qU7aDjIkVniPs6TGkTBOwG9ItrUv50WJfqgOd6ngHfYWzJFIAgZnGjXhtkHartO3F19iPs5VHRhTEUw2HZbgnTjmf2NmJ1onUkMNSFUMPrNkrfxl78GHNjJrGAkRU0jlXqAuIK8v6uYh4oyqSrFtzPru8GNIWCbka6LKLrMcCysoDk7VQI12xzELxUebiUsnLYCrnvmJtD0T7yv9M8H8rI4YsdbzD3forc56uvwqS0h0Bl6Sw71n781EC0R0V6067RA2TRZ39fs7yXYZ4O5pQ1uHn0qV82aZI1kHxWVJ1omu81KqoFpTnB0QuNd62AeVKRiuMiAf2UFhy40vFgFElRZFipH1TrJAuFYcgwd38kJWTGTYyW51z1DQfVZnlIegEfZQPTjnhFroayXw55MKnJGiVPQ4R0A2nO5LCDmDFmC8SyLz62pn0aQb5tvlQs5Es7woSf3SbxcKh9JndW42V8hQn4uXxbEKhAX9f48VJg5xXYsMOaBz4h0UfsOrOucFH3YVA9c7TVszXbSq7kRQkpsy3xkunDaIfdiAx5wdLE7LbPhUwrC1FWnCa2qQQxNUimxrQ35Woar9tNQSwpVd8ybEaivgQ77HPSYjTdkKX2j2TBCzmGVBesOUnWI9r34kRO5xPsPPDoJvLQe6kns75Yjcuz82OEuUai1PLVRKmzkRjyLp3tt5YDkjzuYCOWNchY3Eup1IEDvGf64wu4S1qLvRrl6HI9jZj7Li2GZc9grCTxbqpUQgCbCxdgmS6a396AJNijmG8uNnchGPlnNVm6DskG7T2pWasVuuhYhkyFNoUWuY5mBXurDMEDyyZPxlY9nlQYKHBNgg6ZnNEYnwCqTLzudDBQ48YG9r3700uvz83jJAX18s2Kjm2LlmOuPJON6rzbPua8Ac2Y0HPuQZD9Ikcim2MOyR9mbvtRTPeLAX3issevCDYaBiG6BFMaN9rW7j1UnlKQZYgTCveE4oH8tT7QGwdWENAjW4kGjS93zCS6QYxyUjg03er43KivMQOVHaT3iznZnQD3Nk5c0T9IKqRpcytY7JaRV7kmayUmKc4d1ApFqY8imlu2iTMiVfY16qMqDeulTtcKjKUuyWBrJSENwv238nWXQudShLeCsiwUMUnJvXyHdSsmAaaoG5O3RA4GkiQVAiX63tWPl6GNfweFAcpxoD4x8hZpbQ1SaBo3pRNwwHuAvzOwm0jKWndfugKqlUmPoDZb9Bx6dzDolUJtHSNYVrOACFY26SyeyeiFHnnd6wZFypkDGL4LgeBbU8TqJTjO2lFXVmCQZjTnO75V43vJKoHUrRSUkZJYTyl3c8tqWJmrUZ7lJo4cUrGzoudgzDHH8N8H73TwXboF8rbQ8i4yb9w51L0EnCd3kdmSXI0PJxuQ9CQ8AD9WKTwvEaSWDTZ"

describe

Contains data
  obs:             1                          
 vars:             1                          
------------------------------------------------------------------------------------
              storage   display    value
variable name   type    format     label      variable label
------------------------------------------------------------------------------------
str_var_in_strL strL    %9s                   
------------------------------------------------------------------------------------

save "route\all_in_strL" , replace

之后,我将您的修改应用于我生成的数据。

gen var1 = str_var_in_strL

foreach var of varlist var1 {
     generate str `var'_str = `var' 
     replace `var' = "" 
     compress `var' 
     replace `var' = `var'_str  
     drop `var'_str  
     describe `var' 
}

  obs:             1                          
 vars:             2                          16 Dec 2020 12:07
------------------------------------------------------------------------------------
              storage   display    value
variable name   type    format     label      variable label
------------------------------------------------------------------------------------
str_var_in_strL strL    %9s                   
var1            strL    %9s                   
------------------------------------------------------------------------------------

save "route\edited_strL" , replace 

奇怪的是,我能够使用 haven 库和以下代码将这两个文件导入 R。

library(haven)
file <- "route_to_first_file/all_in_strL.dta"
# Import first file into R.

dta_in_R <- read_dta(
                    file,
                    encoding = NULL,
                    col_select = NULL,
                    skip = 0,
                    n_max = Inf,
                    .name_repair = "unique")



# Import edited file using your loop method into R.
file <- "route_to_edited_file/edited_strL.dta"

edited_dta_in_R<- read_dta(
                          file,
                          encoding = NULL,
                          col_select = NULL,
                          skip = 0,
                          n_max = Inf,
                          .name_repair = "unique")


这里唯一的区别可能是:

  1. 数据集的大小(在我的例子中只有一个观察值)。
  2. 您可能使用的是早期版本的 Stata(我使用的是 Stata 16)。

最后,我认为问题的根源不是数据的 strL 类型,而是您计算机上的可用内存,这可能已通过您在 for 中的 compress 步骤解决您描述的循环。


PS:在 Win10 上一切都是 运行。 R 版本 4.0.3 (2020-10-10) 和 haven_2.3.1