使用 FREAD 将 CSV 文件导入 R 时跳过以特定值开头的行
Skipping rows starting with specific values while importing a CSV file into R using FREAD
我正在尝试从 R
中的 URL
导入 CSV
文件。该文件包含以特定字符串随机开头的行 - '<<<<<<< HEAD', '=======' or '>>>>>>> master'
。包含这些字符的行位于随机行位置。我想避免这些行并导入文档的其余部分。有办法吗?我更喜欢使用 FREAD 来导入数据。感谢输入。
默认情况下不加载数据。它在遇到上述字符串的第一个实例(CSV 的第 347 行)时抛出错误。我试图从中下载数据的 URL 是 "https://raw.githubusercontent.com/RamiKrispin/coronavirus/master/csv/coronavirus.csv"
并且它抛出的错误如下:
[0%] Downloaded 0 bytes...
Warning message:
In data.table::fread("https://raw.githubusercontent.com/RamiKrispin/coronavirus/master/csv/coronavirus.csv", :
Stopped early on line 347. Expected 7 fields but found 1. Consider fill=TRUE and comment.char=. First discarded non-empty line: <<<<<<<<< HEAD>>
我用来下载数据的代码语句是:
covid_ds <- data.table::fread('https://raw.githubusercontent.com/RamiKrispin/coronavirus/master/csv/coronavirus.csv')
您可以使用 read.csv
和 fill = TRUE
读取数据,仅保留 date
列中具有日期格式数据的那些行,以便 '<<<<<<< HEAD'
或 '======='
被删除并使用 type_convert
将它们更改为各自的类型。
data <- read.csv('https://raw.githubusercontent.com/RamiKrispin/coronavirus/master/csv/coronavirus.csv', fill = TRUE)
data <- data[grepl('\d+-\d+-\d+', data$date), ]
data <- readr::type_convert(data)
data
# date province country lat long type cases
# <date> <chr> <chr> <dbl> <dbl> <chr> <int>
# 1 2020-01-22 NA Afghanistan 33.9 67.7 confirmed 0
# 2 2020-01-23 NA Afghanistan 33.9 67.7 confirmed 0
# 3 2020-01-24 NA Afghanistan 33.9 67.7 confirmed 0
# 4 2020-01-25 NA Afghanistan 33.9 67.7 confirmed 0
# 5 2020-01-26 NA Afghanistan 33.9 67.7 confirmed 0
# 6 2020-01-27 NA Afghanistan 33.9 67.7 confirmed 0
# 7 2020-01-28 NA Afghanistan 33.9 67.7 confirmed 0
# 8 2020-01-29 NA Afghanistan 33.9 67.7 confirmed 0
# 9 2020-01-30 NA Afghanistan 33.9 67.7 confirmed 0
#10 2020-01-31 NA Afghanistan 33.9 67.7 confirmed 0
# … with 287,772 more rows
和 data.table::fread
你可以使用 blank.lines.skip=TRUE
.
data <- data.table::fread('https://raw.githubusercontent.com/RamiKrispin/coronavirus/master/csv/coronavirus.csv', blank.lines.skip=TRUE)
我正在尝试从 R
中的 URL
导入 CSV
文件。该文件包含以特定字符串随机开头的行 - '<<<<<<< HEAD', '=======' or '>>>>>>> master'
。包含这些字符的行位于随机行位置。我想避免这些行并导入文档的其余部分。有办法吗?我更喜欢使用 FREAD 来导入数据。感谢输入。
默认情况下不加载数据。它在遇到上述字符串的第一个实例(CSV 的第 347 行)时抛出错误。我试图从中下载数据的 URL 是 "https://raw.githubusercontent.com/RamiKrispin/coronavirus/master/csv/coronavirus.csv"
并且它抛出的错误如下:
[0%] Downloaded 0 bytes...
Warning message:
In data.table::fread("https://raw.githubusercontent.com/RamiKrispin/coronavirus/master/csv/coronavirus.csv", :
Stopped early on line 347. Expected 7 fields but found 1. Consider fill=TRUE and comment.char=. First discarded non-empty line: <<<<<<<<< HEAD>>
我用来下载数据的代码语句是:
covid_ds <- data.table::fread('https://raw.githubusercontent.com/RamiKrispin/coronavirus/master/csv/coronavirus.csv')
您可以使用 read.csv
和 fill = TRUE
读取数据,仅保留 date
列中具有日期格式数据的那些行,以便 '<<<<<<< HEAD'
或 '======='
被删除并使用 type_convert
将它们更改为各自的类型。
data <- read.csv('https://raw.githubusercontent.com/RamiKrispin/coronavirus/master/csv/coronavirus.csv', fill = TRUE)
data <- data[grepl('\d+-\d+-\d+', data$date), ]
data <- readr::type_convert(data)
data
# date province country lat long type cases
# <date> <chr> <chr> <dbl> <dbl> <chr> <int>
# 1 2020-01-22 NA Afghanistan 33.9 67.7 confirmed 0
# 2 2020-01-23 NA Afghanistan 33.9 67.7 confirmed 0
# 3 2020-01-24 NA Afghanistan 33.9 67.7 confirmed 0
# 4 2020-01-25 NA Afghanistan 33.9 67.7 confirmed 0
# 5 2020-01-26 NA Afghanistan 33.9 67.7 confirmed 0
# 6 2020-01-27 NA Afghanistan 33.9 67.7 confirmed 0
# 7 2020-01-28 NA Afghanistan 33.9 67.7 confirmed 0
# 8 2020-01-29 NA Afghanistan 33.9 67.7 confirmed 0
# 9 2020-01-30 NA Afghanistan 33.9 67.7 confirmed 0
#10 2020-01-31 NA Afghanistan 33.9 67.7 confirmed 0
# … with 287,772 more rows
和 data.table::fread
你可以使用 blank.lines.skip=TRUE
.
data <- data.table::fread('https://raw.githubusercontent.com/RamiKrispin/coronavirus/master/csv/coronavirus.csv', blank.lines.skip=TRUE)