在 R 中打开 .bcp 文件
Opening .bcp files in R
我一直在尝试将 .bcp 文件格式的英国慈善委员会数据转换为 .csv 文件格式,然后可以将其读入 R。我所指的数据可在此处获得:http://data.charitycommission.gov.uk/ .我想做的是将这些 .bcp 文件转换为可用的数据帧,我可以在 R 中对其进行清理和 运行 分析。
此 github 页面 https://github.com/ncvo/charity-commission-extract 上有关于如何通过 python 执行此操作的建议,但不幸的是,我无法使这些选项起作用。
我想知道是否有任何语法或包可以让我直接在 R 中打开这些数据?我还没找到。
另一种选择是使用 readLines
将 R 中的文件简单地作为单个字符向量打开。我已经这样做了,文件用 @**@
分隔列,用 *@@*
分隔行。 (参见此处:http://data.charitycommission.gov.uk/data-definition.aspx)。是否有 R 命令允许我从长字符串创建数据框,为行和列定义分隔符?
R 解
编辑版本
不确定所有的.bcp文件格式是否相同。我下载了你提到的数据集,并尝试了最小文件的解决方案; extract_aoo_ref.bcp
library(data.table)
#read the file as-is
text <- readChar("./extract_aoo_ref.bcp",
nchars = file.info( "./extract_aoo_ref.bcp" )$size,
useBytes = TRUE)
#replace column and row separator
text <- gsub( ";", ":", text)
text <- gsub( "@\*\*@", ";", text)
text <- gsub( "\*@@\*", "\n", text, perl = TRUE)
#read the results
result <- data.table::fread( text,
header = FALSE,
sep = ";",
fill = TRUE,
quote = "",
strip.white = TRUE)
head(result,10)
# V1 V2 V3 V4 V5 V6
# 1: A 1 THROUGHOUT ENGLAND AND WALES At least 10 authorities in England and Wales N NA
# 2: B 1 BRACKNELL FOREST BRACKNELL FOREST N NA
# 3: D 1 AFGHANISTAN AFGHANISTAN N 2
# 4: E 1 AFRICA AFRICA N NA
# 5: A 2 THROUGHOUT ENGLAND At least 10 authorities in England only N NA
# 6: B 2 WEST BERKSHIRE WEST BERKSHIRE N NA
# 7: D 2 ALBANIA ALBANIA N 3
# 8: E 2 ASIA ASIA N NA
# 9: A 3 THROUGHOUT WALES At least 10 authorities in Wales only Y NA
# 10: B 3 READING READING N NA
棘手的文件也是如此; extract_charity.bcp
head(result[,1:3],10)
# V1 V2 V3
# 1: 200000 0 HOMEBOUND CRAFTSMEN TRUST
# 2: 200001 0 PAINTERS' COMPANY CHARITY
# 3: 200002 0 THE ROYAL OPERA HOUSE BENEVOLENT FUND
# 4: 200003 0 HERGA WORLD DISTRESS FUND
# 5: 200004 0 THE WILLIAM GOLDSTEIN LAY STAFF BENEVOLENT FUND (ROYAL HOSPITAL OF ST BARTHOLOMEW)
# 6: 200005 0 DEVON AND CORNWALL ROMAN CATHOLIC DEVELOPMENT SOCIETY
# 7: 200006 0 THE HORLEY SICK CHILDREN'S FUND
# 8: 200007 0 THE HOLDENHURST OLD PEOPLE'S HOME TRUST
# 9: 200008 0 LORNA GASCOIGNE TRUST FUND
# 10: 200009 0 THE RALPH LEVY CHARITABLE COMPANY LIMITED
所以..看起来它正在工作:)
我一直在尝试将 .bcp 文件格式的英国慈善委员会数据转换为 .csv 文件格式,然后可以将其读入 R。我所指的数据可在此处获得:http://data.charitycommission.gov.uk/ .我想做的是将这些 .bcp 文件转换为可用的数据帧,我可以在 R 中对其进行清理和 运行 分析。
此 github 页面 https://github.com/ncvo/charity-commission-extract 上有关于如何通过 python 执行此操作的建议,但不幸的是,我无法使这些选项起作用。
我想知道是否有任何语法或包可以让我直接在 R 中打开这些数据?我还没找到。
另一种选择是使用 readLines
将 R 中的文件简单地作为单个字符向量打开。我已经这样做了,文件用 @**@
分隔列,用 *@@*
分隔行。 (参见此处:http://data.charitycommission.gov.uk/data-definition.aspx)。是否有 R 命令允许我从长字符串创建数据框,为行和列定义分隔符?
R 解
编辑版本
不确定所有的.bcp文件格式是否相同。我下载了你提到的数据集,并尝试了最小文件的解决方案; extract_aoo_ref.bcp
library(data.table)
#read the file as-is
text <- readChar("./extract_aoo_ref.bcp",
nchars = file.info( "./extract_aoo_ref.bcp" )$size,
useBytes = TRUE)
#replace column and row separator
text <- gsub( ";", ":", text)
text <- gsub( "@\*\*@", ";", text)
text <- gsub( "\*@@\*", "\n", text, perl = TRUE)
#read the results
result <- data.table::fread( text,
header = FALSE,
sep = ";",
fill = TRUE,
quote = "",
strip.white = TRUE)
head(result,10)
# V1 V2 V3 V4 V5 V6
# 1: A 1 THROUGHOUT ENGLAND AND WALES At least 10 authorities in England and Wales N NA
# 2: B 1 BRACKNELL FOREST BRACKNELL FOREST N NA
# 3: D 1 AFGHANISTAN AFGHANISTAN N 2
# 4: E 1 AFRICA AFRICA N NA
# 5: A 2 THROUGHOUT ENGLAND At least 10 authorities in England only N NA
# 6: B 2 WEST BERKSHIRE WEST BERKSHIRE N NA
# 7: D 2 ALBANIA ALBANIA N 3
# 8: E 2 ASIA ASIA N NA
# 9: A 3 THROUGHOUT WALES At least 10 authorities in Wales only Y NA
# 10: B 3 READING READING N NA
棘手的文件也是如此; extract_charity.bcp
head(result[,1:3],10)
# V1 V2 V3
# 1: 200000 0 HOMEBOUND CRAFTSMEN TRUST
# 2: 200001 0 PAINTERS' COMPANY CHARITY
# 3: 200002 0 THE ROYAL OPERA HOUSE BENEVOLENT FUND
# 4: 200003 0 HERGA WORLD DISTRESS FUND
# 5: 200004 0 THE WILLIAM GOLDSTEIN LAY STAFF BENEVOLENT FUND (ROYAL HOSPITAL OF ST BARTHOLOMEW)
# 6: 200005 0 DEVON AND CORNWALL ROMAN CATHOLIC DEVELOPMENT SOCIETY
# 7: 200006 0 THE HORLEY SICK CHILDREN'S FUND
# 8: 200007 0 THE HOLDENHURST OLD PEOPLE'S HOME TRUST
# 9: 200008 0 LORNA GASCOIGNE TRUST FUND
# 10: 200009 0 THE RALPH LEVY CHARITABLE COMPANY LIMITED
所以..看起来它正在工作:)