阅读奇怪的格式化程序 CSV 文件
Reading oddly formatter CSV file
我正在考虑从 statistics.gov.scot 网站下载一些数据。例如,我想获取一些关于住院率的数据。获取我感兴趣的数据 table 的查询格式为:
http://statistics.gov.scot/slice/observations.csv?&dataset=http%3A%2F%2Fstatistics.gov.scot%2Fdata%2Freconvictions&http%3A%2F%2Fpurl.org%2Flinked-data%2Fcube%23measureType=http%3A%2F%2Fstatistics.gov.scot%2Fdef%2Fmeasure-properties%2Fratio&http%3A%2F%2Fstatistics.gov.scot%2Fdef%2Fdimension%2Fage=http%3A%2F%2Fstatistics.gov.scot%2Fdef%2Fconcept%2Fage%2Fall&http%3A%2F%2Fstatistics.gov.scot%2Fdef%2Fdimension%2Fgender=http%3A%2F%2Fstatistics.gov.scot%2Fdef%2Fconcept%2Fgender%2Fall
并可通过 this link 访问,供想要尝试的人使用。查询生成一个包含相关信息的 *.CSV
文件,但是,文件的格式带来了一些挑战。
文件示例
文件内容如下:
Generated by http://statistics.gov.scot,2016-03-15T10:41:28+00:00
http://statistics.gov.scot/data/hospital-admissions,Hospital Admissions
measure type,""
Admission Type,""
Age,""
Gender,""
Measure (cell values): ,"Ratio (Rate Per 100,000 Population)"
,,http://reference.data.gov.uk/id/year/2002,http://reference.data.gov.uk/id/year/2003,http://reference.data.gov.uk/id/year/2004,http://reference.data.gov.uk/id/year/2005,http://reference.data.gov.uk/id/year/2006,http://reference.data.gov.uk/id/year/2007,http://reference.data.gov.uk/id/year/2008,http://reference.data.gov.uk/id/year/2009,http://reference.data.gov.uk/id/year/2010,http://reference.data.gov.uk/id/year/2011,http://reference.data.gov.uk/id/year/2012
http://purl.org/linked-data/sdmx/2009/dimension#refArea,Reference Area,2002,2003,2004,2005,2006,2007,2008,2009,2010,2011,2012
http://statistics.gov.scot/id/statistical-geography/S92000003,Scotland,"9,351","9,262","9,261","9,347","9,723","10,517","10,293","10,150","10,024","10,232","10,194"
导入到 Excel 时:
但是,当通过 read.csv
导入到 R 时,它看起来像这样:
> head(problematicFile)
V1 V2
1 Generated by http://statistics.gov.scot 2016-03-15T10:36:29+00:00
2 http://statistics.gov.scot/data/hospital-admissions Hospital Admissions
3 measure type
4 Admission Type
5 Age
6 Gender
问题
read.csv
导入returns只有两列。我猜这个问题与某些初始列为空有关。我想以类似于 Excel 中实现的图示导入的方式读取此文件。关键是,我打算在 A 和 B 列中使用行 7 中的值,并且,自然是下面的数据table。在生成 data.frame
方面,我很乐意在有空单元格的地方包含 NA
值,但其尺寸与 Excel 中的尺寸相同。我试过了:
read.csv(file = link, header = FALSE, na.strings = "",
fill = TRUE)
但我一直遇到同样的问题。
想要的结果
期望的结果应该像这样(手动生成的提取物):
Generated by http://statistics.gov.scot 2016-03-15T10:41:28+00:00 NA NA NA NA NA NA NA
http://statistics.gov.scot/data/hospital-admissions Hospital Admissions NA NA NA NA NA NA NA
measure type NA NA NA NA NA NA NA NA
Admission Type NA NA NA NA NA NA NA NA
Age NA NA NA NA NA NA NA NA
Gender NA NA NA NA NA NA NA NA
Measure (cell values): Ratio (Rate Per 100,000 Population) NA NA NA NA NA
NA NA NA NA NA NA NA NA NA
NA NA http://reference.data.gov.uk/id/year/2002 http://reference.data.gov.uk/id/year/2003 http://reference.data.gov.uk/id/year/2004 http://reference.data.gov.uk/id/year/2005 http://reference.data.gov.uk/id/year/2006 http://reference.data.gov.uk/id/year/2007 http://reference.data.gov.uk/id/year/2008
http://purl.org/linked-data/sdmx/2009/dimension#refArea Reference Area 2002 2003 2004 2005 2006 2007 2008
http://statistics.gov.scot/id/statistical-geography/S92000003 Scotland 9,351 9,262 9,261 9,347 9,723 10,517 10,293
http://statistics.gov.scot/id/statistical-geography/S16000082 Angus South 8,236 8,500 8,523 8,371 8,616 8,978 9,325
http://statistics.gov.scot/id/statistical-geography/S16000106 Edinburgh Northern and Leith 9,040 8,040 7,925 9,042 10,355 11,833 8,916
http://statistics.gov.scot/id/statistical-geography/S16000140 Renfrewshire South 9,391 9,122 9,491 9,586 10,425 10,900 11,065
http://statistics.gov.scot/id/statistical-geography/S16000108 Edinburgh Southern 5,878 5,910 6,101 6,035 7,426 9,343 6,766
http://statistics.gov.scot/id/statistical-geography/S16000075 Aberdeen Donside 10,047 10,963 10,629 10,512 10,383 10,787 10,685
http://statistics.gov.scot/id/statistical-geography/S16000137 Perthshire North 9,388 9,524 7,799 9,350 9,543 9,791 9,991
http://statistics.gov.scot/id/statistical-geography/S16000077 Aberdeenshire East 7,211 7,300 7,153 7,411 7,435 7,268 7,547
http://statistics.gov.scot/id/statistical-geography/S16000114 Galloway and West Dumfries 9,861 9,165 8,143 9,258 7,508 10,213 10,399
http://statistics.gov.scot/id/statistical-geography/S16000096 Dumbarton 8,703 8,570 8,727 9,310 9,389 9,885 10,237
截图
为了进一步说明,我想维护维度并用 NA
s:
填充缺失值
您需要手动指定 col.names
以强制 read.csv 阅读多列。同时将 na.strings
指定为空字符串会将 NA
值保留在空列中。
read.csv(<parameters>, col.names=c("Col1","Col2".....), na.strings="")
您可以使用 read.table 和提供的列名来指定列数:
read.table(file = link,
fill = TRUE,
sep = ",",
na.strings = "",
col.names = paste("c", 1:12, sep = ""))
但是,我不知道这是否是一个好的解决方案,因为您需要事先知道列数。
另一种方法是将整个 csv 读取为字符串。然后你可以 pre-process 通过将 header 存储在另一个 object (例如列表)中,你可以只使用 "table part" 作为数据框。
从 headers 解析元数据有点棘手。您可能更愿意下载整个标准化数据集,而不是 cross-tabulated 切片。
> reconv <- read.csv("http://statistics.gov.scot/downloads/cube-table?uri=http%3A%2F%2Fstatistics.gov.scot%2Fdata%2Freconvictions")
> head(reconv)
GeographyCode DateCode Measurement Units Value Gender Age
1 S92000003 2003 Mean Average reconvictions per offender 0.62 All All
2 S92000003 2004 Mean Average reconvictions per offender 0.33 All All
3 S92000003 2004 Mean Average reconvictions per offender 0.61 All All
4 S92000003 2005 Mean Average reconvictions per offender 0.60 All All
5 S92000003 2006 Mean Average reconvictions per offender 0.60 All All
6 S92000003 2007 Mean Average reconvictions per offender 0.11 All All
这会将所有元数据放入因子级别(因此您不必解析它):
> str(reconv)
'data.frame': 10119 obs. of 7 variables:
$ GeographyCode: Factor w/ 26 levels "S12000005","S12000006",..: 26 26 26 26 26 26 26 26 26 26 ...
$ DateCode : int 2003 2004 2004 2005 2006 2007 2007 2008 2008 2009 ...
$ Measurement : Factor w/ 2 levels "Mean","Ratio": 1 1 1 1 1 1 1 1 1 1 ...
$ Units : Factor w/ 2 levels "Average reconvictions per offender",..: 1 1 1 1 1 1 1 1 1 1 ...
$ Value : num 0.62 0.33 0.61 0.6 0.6 0.11 0.57 0.6 0.33 0.33 ...
$ Gender : Factor w/ 3 levels "All","Female",..: 1 1 1 1 1 1 1 1 1 1 ...
$ Age : Factor w/ 6 levels "21-25","26-30",..: 4 4 4 4 4 4 4 4 4 4 ...
您可以select您感兴趣的切片:
> slice <- subset(reconv, Measurement=="Ratio" & Gender=="All" & Age=="All")
如果需要,可以回到原来的 cross-tabulated 切片:
> library(reshape2)
> dcast(slice, GeographyCode ~ DateCode, value.var="Value", fun.aggregate = first)
GeographyCode 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013
1 S12000005 41.4 34.3 41.0 40.7 37.4 37.2 33.3 34.6 35.8 33.0 32.8
2 S12000006 34.9 36.0 31.9 34.2 31.1 28.7 27.9 29.6 27.5 26.8 27.0
3 S12000008 33.7 33.2 33.7 33.2 31.7 32.8 30.4 31.5 29.1 28.1 28.7
4 S12000010 26.7 24.5 25.7 26.9 26.7 27.8 29.3 25.1 22.4 29.0 28.2
5 S12000013 31.7 26.1 30.6 35.4 31.6 25.9 24.0 18.9 30.5 22.8 18.6
...
我正在考虑从 statistics.gov.scot 网站下载一些数据。例如,我想获取一些关于住院率的数据。获取我感兴趣的数据 table 的查询格式为:
http://statistics.gov.scot/slice/observations.csv?&dataset=http%3A%2F%2Fstatistics.gov.scot%2Fdata%2Freconvictions&http%3A%2F%2Fpurl.org%2Flinked-data%2Fcube%23measureType=http%3A%2F%2Fstatistics.gov.scot%2Fdef%2Fmeasure-properties%2Fratio&http%3A%2F%2Fstatistics.gov.scot%2Fdef%2Fdimension%2Fage=http%3A%2F%2Fstatistics.gov.scot%2Fdef%2Fconcept%2Fage%2Fall&http%3A%2F%2Fstatistics.gov.scot%2Fdef%2Fdimension%2Fgender=http%3A%2F%2Fstatistics.gov.scot%2Fdef%2Fconcept%2Fgender%2Fall
并可通过 this link 访问,供想要尝试的人使用。查询生成一个包含相关信息的 *.CSV
文件,但是,文件的格式带来了一些挑战。
文件示例
文件内容如下:
Generated by http://statistics.gov.scot,2016-03-15T10:41:28+00:00
http://statistics.gov.scot/data/hospital-admissions,Hospital Admissions
measure type,""
Admission Type,""
Age,""
Gender,""
Measure (cell values): ,"Ratio (Rate Per 100,000 Population)"
,,http://reference.data.gov.uk/id/year/2002,http://reference.data.gov.uk/id/year/2003,http://reference.data.gov.uk/id/year/2004,http://reference.data.gov.uk/id/year/2005,http://reference.data.gov.uk/id/year/2006,http://reference.data.gov.uk/id/year/2007,http://reference.data.gov.uk/id/year/2008,http://reference.data.gov.uk/id/year/2009,http://reference.data.gov.uk/id/year/2010,http://reference.data.gov.uk/id/year/2011,http://reference.data.gov.uk/id/year/2012
http://purl.org/linked-data/sdmx/2009/dimension#refArea,Reference Area,2002,2003,2004,2005,2006,2007,2008,2009,2010,2011,2012
http://statistics.gov.scot/id/statistical-geography/S92000003,Scotland,"9,351","9,262","9,261","9,347","9,723","10,517","10,293","10,150","10,024","10,232","10,194"
导入到 Excel 时:
但是,当通过 read.csv
导入到 R 时,它看起来像这样:
> head(problematicFile)
V1 V2
1 Generated by http://statistics.gov.scot 2016-03-15T10:36:29+00:00
2 http://statistics.gov.scot/data/hospital-admissions Hospital Admissions
3 measure type
4 Admission Type
5 Age
6 Gender
问题
read.csv
导入returns只有两列。我猜这个问题与某些初始列为空有关。我想以类似于 Excel 中实现的图示导入的方式读取此文件。关键是,我打算在 A 和 B 列中使用行 7 中的值,并且,自然是下面的数据table。在生成 data.frame
方面,我很乐意在有空单元格的地方包含 NA
值,但其尺寸与 Excel 中的尺寸相同。我试过了:
read.csv(file = link, header = FALSE, na.strings = "",
fill = TRUE)
但我一直遇到同样的问题。
想要的结果
期望的结果应该像这样(手动生成的提取物):
Generated by http://statistics.gov.scot 2016-03-15T10:41:28+00:00 NA NA NA NA NA NA NA
http://statistics.gov.scot/data/hospital-admissions Hospital Admissions NA NA NA NA NA NA NA
measure type NA NA NA NA NA NA NA NA
Admission Type NA NA NA NA NA NA NA NA
Age NA NA NA NA NA NA NA NA
Gender NA NA NA NA NA NA NA NA
Measure (cell values): Ratio (Rate Per 100,000 Population) NA NA NA NA NA
NA NA NA NA NA NA NA NA NA
NA NA http://reference.data.gov.uk/id/year/2002 http://reference.data.gov.uk/id/year/2003 http://reference.data.gov.uk/id/year/2004 http://reference.data.gov.uk/id/year/2005 http://reference.data.gov.uk/id/year/2006 http://reference.data.gov.uk/id/year/2007 http://reference.data.gov.uk/id/year/2008
http://purl.org/linked-data/sdmx/2009/dimension#refArea Reference Area 2002 2003 2004 2005 2006 2007 2008
http://statistics.gov.scot/id/statistical-geography/S92000003 Scotland 9,351 9,262 9,261 9,347 9,723 10,517 10,293
http://statistics.gov.scot/id/statistical-geography/S16000082 Angus South 8,236 8,500 8,523 8,371 8,616 8,978 9,325
http://statistics.gov.scot/id/statistical-geography/S16000106 Edinburgh Northern and Leith 9,040 8,040 7,925 9,042 10,355 11,833 8,916
http://statistics.gov.scot/id/statistical-geography/S16000140 Renfrewshire South 9,391 9,122 9,491 9,586 10,425 10,900 11,065
http://statistics.gov.scot/id/statistical-geography/S16000108 Edinburgh Southern 5,878 5,910 6,101 6,035 7,426 9,343 6,766
http://statistics.gov.scot/id/statistical-geography/S16000075 Aberdeen Donside 10,047 10,963 10,629 10,512 10,383 10,787 10,685
http://statistics.gov.scot/id/statistical-geography/S16000137 Perthshire North 9,388 9,524 7,799 9,350 9,543 9,791 9,991
http://statistics.gov.scot/id/statistical-geography/S16000077 Aberdeenshire East 7,211 7,300 7,153 7,411 7,435 7,268 7,547
http://statistics.gov.scot/id/statistical-geography/S16000114 Galloway and West Dumfries 9,861 9,165 8,143 9,258 7,508 10,213 10,399
http://statistics.gov.scot/id/statistical-geography/S16000096 Dumbarton 8,703 8,570 8,727 9,310 9,389 9,885 10,237
截图
为了进一步说明,我想维护维度并用 NA
s:
您需要手动指定 col.names
以强制 read.csv 阅读多列。同时将 na.strings
指定为空字符串会将 NA
值保留在空列中。
read.csv(<parameters>, col.names=c("Col1","Col2".....), na.strings="")
您可以使用 read.table 和提供的列名来指定列数:
read.table(file = link,
fill = TRUE,
sep = ",",
na.strings = "",
col.names = paste("c", 1:12, sep = ""))
但是,我不知道这是否是一个好的解决方案,因为您需要事先知道列数。
另一种方法是将整个 csv 读取为字符串。然后你可以 pre-process 通过将 header 存储在另一个 object (例如列表)中,你可以只使用 "table part" 作为数据框。
从 headers 解析元数据有点棘手。您可能更愿意下载整个标准化数据集,而不是 cross-tabulated 切片。
> reconv <- read.csv("http://statistics.gov.scot/downloads/cube-table?uri=http%3A%2F%2Fstatistics.gov.scot%2Fdata%2Freconvictions")
> head(reconv)
GeographyCode DateCode Measurement Units Value Gender Age
1 S92000003 2003 Mean Average reconvictions per offender 0.62 All All
2 S92000003 2004 Mean Average reconvictions per offender 0.33 All All
3 S92000003 2004 Mean Average reconvictions per offender 0.61 All All
4 S92000003 2005 Mean Average reconvictions per offender 0.60 All All
5 S92000003 2006 Mean Average reconvictions per offender 0.60 All All
6 S92000003 2007 Mean Average reconvictions per offender 0.11 All All
这会将所有元数据放入因子级别(因此您不必解析它):
> str(reconv)
'data.frame': 10119 obs. of 7 variables:
$ GeographyCode: Factor w/ 26 levels "S12000005","S12000006",..: 26 26 26 26 26 26 26 26 26 26 ...
$ DateCode : int 2003 2004 2004 2005 2006 2007 2007 2008 2008 2009 ...
$ Measurement : Factor w/ 2 levels "Mean","Ratio": 1 1 1 1 1 1 1 1 1 1 ...
$ Units : Factor w/ 2 levels "Average reconvictions per offender",..: 1 1 1 1 1 1 1 1 1 1 ...
$ Value : num 0.62 0.33 0.61 0.6 0.6 0.11 0.57 0.6 0.33 0.33 ...
$ Gender : Factor w/ 3 levels "All","Female",..: 1 1 1 1 1 1 1 1 1 1 ...
$ Age : Factor w/ 6 levels "21-25","26-30",..: 4 4 4 4 4 4 4 4 4 4 ...
您可以select您感兴趣的切片:
> slice <- subset(reconv, Measurement=="Ratio" & Gender=="All" & Age=="All")
如果需要,可以回到原来的 cross-tabulated 切片:
> library(reshape2)
> dcast(slice, GeographyCode ~ DateCode, value.var="Value", fun.aggregate = first)
GeographyCode 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013
1 S12000005 41.4 34.3 41.0 40.7 37.4 37.2 33.3 34.6 35.8 33.0 32.8
2 S12000006 34.9 36.0 31.9 34.2 31.1 28.7 27.9 29.6 27.5 26.8 27.0
3 S12000008 33.7 33.2 33.7 33.2 31.7 32.8 30.4 31.5 29.1 28.1 28.7
4 S12000010 26.7 24.5 25.7 26.9 26.7 27.8 29.3 25.1 22.4 29.0 28.2
5 S12000013 31.7 26.1 30.6 35.4 31.6 25.9 24.0 18.9 30.5 22.8 18.6
...