为什么，而不是如何：Stata 错误地识别大型数据集上的 var 类型，其中 vars 具有混合（字符串+数字）值

Question

我正在读取一个 500 万观察管道分隔的文本文件。一列的前 250,000 个值是数字；其余的是字符串。下面的代码导入了前 250,000 个数值，将变量声明为数值（长整型），并将字符串值视为缺失。

import delimited "mixed_types.txt", delimiter("|")

解决方案：将所有变量导入为字符串，然后解串：

import delimited "mixed_types.txt", delimiter("|") stringcols(_all)
destring, replace

我的问题是，为什么？ import delimited 的帮助文件指出，“import delimited 将根据第一行数据检查文件是否由制表符或逗号分隔。”是否遵循此规则来分配 var 类型？

Answer 1

这不是预期的行为。

来自http://www.stata.com/help.cgi?whatsnew阅读：

import delimited has the following fixes:

a. import delimited, when string data were not present until row number 5,000 or higher for a variable in the imported text file, incorrectly chose a numeric data type instead of a string data type for that variable. This has been fixed.

你需要update。参见 help update。

（可以访问相同的信息运行 help whatsnew。更新是针对 Stata 14 的。）

为什么，而不是如何：Stata 错误地识别大型数据集上的 var 类型，其中 vars 具有混合（字符串+数字）值

Why, not How: Stata incorrectly identifies var type on large dataset with mixed (string+numeric) values within vars

stata