Stata:导入分隔符不一致的 .txt
Stata: importing .txt with inconsistent delimiters
我有一个包含相对奇怪的分隔符的 .txt 文件。数据看起来像这样:
|ABC4|,|Name1|,|NameRaw1|,|y|,|XY1|,10000.0,| |,|FOURTH QUARTER REPORT|,||
|ABC5|,|Name2, extraname|,|NameRaw2|,,|XY2|,266539.0,|pac |,|MID-YEAR REPORT|,||
|ABC6|,|Name3|,|NameRaw3|,|y|,|X,Y3|,60000.0,|name |,|YEAR-END REPORT|,|XYZ|
所以有一些变量没有任何管道的问题,比如这里的第六个变量,它只是一个没有管道的数量,有些变量只有在它们是空的时候才没有管道,就像这里的第四个变量,它是 ,,
或 ,|y|,
。有些变量也有逗号,所以我不能用逗号作为分隔符。所以基本上有两个问题:
- 分隔符是逗号,但逗号也出现在字符串值中
- 有些变量在管道内,有些不在管道内,有些只有当它们不为空时才在管道内
我正在寻找一种在 Stata 中解决此问题的方法。有人知道怎么做吗?
如果整个数据集比这个例子更混乱,我真的不想知道。但这似乎有点道理。
* Example generated by -dataex-. To install: ssc install dataex
clear
input str100 whatever
"|ABC4|,|Name1|,|NameRaw1|,|y|,|XY1|,10000.0,| |,|FOURTH QUARTER REPORT|,||"
"|ABC5|,|Name2, extraname|,|NameRaw2|,,|XY2|,266539.0,|pac |,|MID-YEAR REPORT|,||"
"|ABC6|,|Name3|,|NameRaw3|,|y|,|X,Y3|,60000.0,|name |,|YEAR-END REPORT|,|XYZ|"
end
gen work = whatever
replace work = subinstr(work, ",,", ",||,", .)
forval j = 1/5 {
gen work`j' = substr(work, 1, strpos(work, "|,") + 1)
replace work = subinstr(work, work`j', "", 1)
}
gen work6 = substr(work, 1, strpos(work, ","))
replace work = subinstr(work, work6, "", 1)
forval j = 7/8 {
gen work`j' = substr(work, 1, strpos(work, "|,") + 1)
replace work = subinstr(work, work`j', "", 1)
}
gen work9 = work
drop work
forval j = 1/9 {
replace work`j' = trim(subinstr(work`j', "|", "", .))
replace work`j' = substr(work`j', 1, length(work`j') - 1) if substr(work`j', -1, 1) == ","
}
list
+-----------------------------------------------------------------------------------+
1. | whatever |
| |ABC4|,|Name1|,|NameRaw1|,|y|,|XY1|,10000.0,| |,|FOURTH QUARTER REPORT|,|| |
|-----------------------------------------------------------------------------------|
| work1 | work2 | work3 | work4 | work5 | work6 | work7 |
| ABC4 | Name1 | NameRaw1 | y | XY1 | 10000.0 | |
|-----------------------------------------------------------------------------------|
| work8 | work9 |
| FOURTH QUARTER REPORT | |
+-----------------------------------------------------------------------------------+
+-----------------------------------------------------------------------------------+
2. | whatever |
| |ABC5|,|Name2, extraname|,|NameRaw2|,,|XY2|,266539.0,|pac |,|MID-YEAR REPORT|,|| |
|-----------------------------------------------------------------------------------|
| work1 | work2 | work3 | work4 | work5 | work6 | work7 |
| ABC5 | Name2, extraname | NameRaw2 | | XY2 | 266539.0 | pac |
|-----------------------------------------------------------------------------------|
| work8 | work9 |
| MID-YEAR REPORT | |
+-----------------------------------------------------------------------------------+
+-----------------------------------------------------------------------------------+
3. | whatever |
| |ABC6|,|Name3|,|NameRaw3|,|y|,|X,Y3|,60000.0,|name |,|YEAR-END REPORT|,|XYZ| |
|-----------------------------------------------------------------------------------|
| work1 | work2 | work3 | work4 | work5 | work6 | work7 |
| ABC6 | Name3 | NameRaw3 | y | X,Y3 | 60000.0 | name |
|-----------------------------------------------------------------------------------|
| work8 | work9 |
| YEAR-END REPORT | XYZ |
+-----------------------------------------------------------------------------------+
我有一个包含相对奇怪的分隔符的 .txt 文件。数据看起来像这样:
|ABC4|,|Name1|,|NameRaw1|,|y|,|XY1|,10000.0,| |,|FOURTH QUARTER REPORT|,||
|ABC5|,|Name2, extraname|,|NameRaw2|,,|XY2|,266539.0,|pac |,|MID-YEAR REPORT|,||
|ABC6|,|Name3|,|NameRaw3|,|y|,|X,Y3|,60000.0,|name |,|YEAR-END REPORT|,|XYZ|
所以有一些变量没有任何管道的问题,比如这里的第六个变量,它只是一个没有管道的数量,有些变量只有在它们是空的时候才没有管道,就像这里的第四个变量,它是 ,,
或 ,|y|,
。有些变量也有逗号,所以我不能用逗号作为分隔符。所以基本上有两个问题:
- 分隔符是逗号,但逗号也出现在字符串值中
- 有些变量在管道内,有些不在管道内,有些只有当它们不为空时才在管道内
我正在寻找一种在 Stata 中解决此问题的方法。有人知道怎么做吗?
如果整个数据集比这个例子更混乱,我真的不想知道。但这似乎有点道理。
* Example generated by -dataex-. To install: ssc install dataex
clear
input str100 whatever
"|ABC4|,|Name1|,|NameRaw1|,|y|,|XY1|,10000.0,| |,|FOURTH QUARTER REPORT|,||"
"|ABC5|,|Name2, extraname|,|NameRaw2|,,|XY2|,266539.0,|pac |,|MID-YEAR REPORT|,||"
"|ABC6|,|Name3|,|NameRaw3|,|y|,|X,Y3|,60000.0,|name |,|YEAR-END REPORT|,|XYZ|"
end
gen work = whatever
replace work = subinstr(work, ",,", ",||,", .)
forval j = 1/5 {
gen work`j' = substr(work, 1, strpos(work, "|,") + 1)
replace work = subinstr(work, work`j', "", 1)
}
gen work6 = substr(work, 1, strpos(work, ","))
replace work = subinstr(work, work6, "", 1)
forval j = 7/8 {
gen work`j' = substr(work, 1, strpos(work, "|,") + 1)
replace work = subinstr(work, work`j', "", 1)
}
gen work9 = work
drop work
forval j = 1/9 {
replace work`j' = trim(subinstr(work`j', "|", "", .))
replace work`j' = substr(work`j', 1, length(work`j') - 1) if substr(work`j', -1, 1) == ","
}
list
+-----------------------------------------------------------------------------------+
1. | whatever |
| |ABC4|,|Name1|,|NameRaw1|,|y|,|XY1|,10000.0,| |,|FOURTH QUARTER REPORT|,|| |
|-----------------------------------------------------------------------------------|
| work1 | work2 | work3 | work4 | work5 | work6 | work7 |
| ABC4 | Name1 | NameRaw1 | y | XY1 | 10000.0 | |
|-----------------------------------------------------------------------------------|
| work8 | work9 |
| FOURTH QUARTER REPORT | |
+-----------------------------------------------------------------------------------+
+-----------------------------------------------------------------------------------+
2. | whatever |
| |ABC5|,|Name2, extraname|,|NameRaw2|,,|XY2|,266539.0,|pac |,|MID-YEAR REPORT|,|| |
|-----------------------------------------------------------------------------------|
| work1 | work2 | work3 | work4 | work5 | work6 | work7 |
| ABC5 | Name2, extraname | NameRaw2 | | XY2 | 266539.0 | pac |
|-----------------------------------------------------------------------------------|
| work8 | work9 |
| MID-YEAR REPORT | |
+-----------------------------------------------------------------------------------+
+-----------------------------------------------------------------------------------+
3. | whatever |
| |ABC6|,|Name3|,|NameRaw3|,|y|,|X,Y3|,60000.0,|name |,|YEAR-END REPORT|,|XYZ| |
|-----------------------------------------------------------------------------------|
| work1 | work2 | work3 | work4 | work5 | work6 | work7 |
| ABC6 | Name3 | NameRaw3 | y | X,Y3 | 60000.0 | name |
|-----------------------------------------------------------------------------------|
| work8 | work9 |
| YEAR-END REPORT | XYZ |
+-----------------------------------------------------------------------------------+