使用 read_csv_arrow 时如何避免时区偏移
How to avoid timezone offset when using read_csv_arrow
假设我有一个 csv 文件。比如这个,https://www.misoenergy.org/planning/generator-interconnection/GI_Queue/gi-interactive-queue/#
如果我这样做
miso_queue <- read_csv_arrow("GI Interactive Queue.csv", as_data_frame = FALSE, timestamp_parsers = "%m/%d/%Y")
miso_queue %>% collect()
# A tibble: 3,343 x 24
`Project #` `Request Status` `Queue Date` `Withdrawn Date` `Done Date` `Appl In Service ~` `Transmission ~` County State
<chr> <chr> <dttm> <dttm> <dttm> <dttm> <chr> <chr> <chr>
1 E002 Done 2013-09-12 20:00:00 NA 2003-12-12 19:00:00 NA Entergy Point~ LA
2 E291 Done 2012-05-14 20:00:00 NA 2013-10-21 20:00:00 2015-12-31 19:00:00 Entergy NA TX
3 G001 Withdrawn 1995-11-07 19:00:00 NA NA NA American Transm~ Brown~ WI
4 G002 Done 1998-11-30 19:00:00 NA NA NA LG&E and KU Ser~ Trimb~ KY
它似乎假设文件是 GMT,然后将日期的 GMT 表示转换为我的本地时区(东部)。
我可以在加载文件之前执行 Sys.setenv(TZ="GMT")
,这样就避免了偏移问题。
Sys.setenv(TZ="GMT")
miso_queue <- read_csv_arrow("GI Interactive Queue.csv", as_data_frame = FALSE, timestamp_parsers = "%m/%d/%Y")
miso_queue %>% collect()
# A tibble: 3,343 x 24
`Project #` `Request Status` `Queue Date` `Withdrawn Date` `Done Date` `Appl In Service ~` `Transmission ~` County State
<chr> <chr> <dttm> <dttm> <dttm> <dttm> <chr> <chr> <chr>
1 E002 Done 2013-09-13 00:00:00 NA 2003-12-13 00:00:00 NA Entergy Point~ LA
2 E291 Done 2012-05-15 00:00:00 NA 2013-10-22 00:00:00 2016-01-01 00:00:00 Entergy NA TX
3 G001 Withdrawn 1995-11-08 00:00:00 NA NA NA American Transm~ Brown~ WI
4 G002 Done 1998-12-01 00:00:00 NA NA NA LG&E and KU Ser~ Trimb~ KY
虽然将我的会话 tz 设置为 GMT 并不太麻烦,但我想知道是否有办法让它假设文件与我的本地时区相同并保持原样,或者如果它想假设它在文件中是格林威治标准时间,然后不管我当地的时区如何,都将其保持在格林威治标准时间。
It seems like it's assuming the file is in GMT and then converts the GMT representation of the date to my local time zone (Eastern).
实际上,您看到的时区转换恰好在您打印时发生。如果将数据框保存到变量并在更改当前时区之前和之后打印它,您可以看到这一点:
miso_queue <- read_csv_arrow("GI Interactive Queue.csv", as_data_frame = FALSE, timestamp_parsers = "%m/%d/%Y")
df <- miso_queue %>% collect()
Sys.setenv(TZ="US/Pacific")
test[,"Queue Date"]
# # A tibble: 3,343 × 1
# `Queue Date`
# <dttm>
# 1 2013-09-12 17:00:00
# 2 2012-05-14 17:00:00
# 3 1995-11-07 16:00:00
# 4 1998-11-30 16:00:00
# 5 1998-11-30 16:00:00
# 6 1998-11-30 16:00:00
# 7 1999-02-14 16:00:00
# 8 1999-02-14 16:00:00
# 9 1999-07-29 17:00:00
# 10 1999-08-12 17:00:00
# # … with 3,333 more rows
Sys.setenv(TZ="GMT")
test[,"Queue Date"]
# # A tibble: 3,343 × 1
# `Queue Date`
# <dttm>
# 1 2013-09-13 00:00:00
# 2 2012-05-15 00:00:00
# 3 1995-11-08 00:00:00
# 4 1998-12-01 00:00:00
# 5 1998-12-01 00:00:00
# 6 1998-12-01 00:00:00
# 7 1999-02-15 00:00:00
# 8 1999-02-15 00:00:00
# 9 1999-07-30 00:00:00
# 10 1999-08-13 00:00:00
# # … with 3,333 more rows
但是,在您展示的示例中没有时间数据,因此您最好将该列作为日期而不是时间戳来读取。不幸的是,现在我认为如果您提供整个 table 的架构,Arrow 只允许您现在解析为日期。一种替代方法是在阅读后解析日期列。
假设我有一个 csv 文件。比如这个,https://www.misoenergy.org/planning/generator-interconnection/GI_Queue/gi-interactive-queue/#
如果我这样做
miso_queue <- read_csv_arrow("GI Interactive Queue.csv", as_data_frame = FALSE, timestamp_parsers = "%m/%d/%Y")
miso_queue %>% collect()
# A tibble: 3,343 x 24
`Project #` `Request Status` `Queue Date` `Withdrawn Date` `Done Date` `Appl In Service ~` `Transmission ~` County State
<chr> <chr> <dttm> <dttm> <dttm> <dttm> <chr> <chr> <chr>
1 E002 Done 2013-09-12 20:00:00 NA 2003-12-12 19:00:00 NA Entergy Point~ LA
2 E291 Done 2012-05-14 20:00:00 NA 2013-10-21 20:00:00 2015-12-31 19:00:00 Entergy NA TX
3 G001 Withdrawn 1995-11-07 19:00:00 NA NA NA American Transm~ Brown~ WI
4 G002 Done 1998-11-30 19:00:00 NA NA NA LG&E and KU Ser~ Trimb~ KY
它似乎假设文件是 GMT,然后将日期的 GMT 表示转换为我的本地时区(东部)。
我可以在加载文件之前执行 Sys.setenv(TZ="GMT")
,这样就避免了偏移问题。
Sys.setenv(TZ="GMT")
miso_queue <- read_csv_arrow("GI Interactive Queue.csv", as_data_frame = FALSE, timestamp_parsers = "%m/%d/%Y")
miso_queue %>% collect()
# A tibble: 3,343 x 24
`Project #` `Request Status` `Queue Date` `Withdrawn Date` `Done Date` `Appl In Service ~` `Transmission ~` County State
<chr> <chr> <dttm> <dttm> <dttm> <dttm> <chr> <chr> <chr>
1 E002 Done 2013-09-13 00:00:00 NA 2003-12-13 00:00:00 NA Entergy Point~ LA
2 E291 Done 2012-05-15 00:00:00 NA 2013-10-22 00:00:00 2016-01-01 00:00:00 Entergy NA TX
3 G001 Withdrawn 1995-11-08 00:00:00 NA NA NA American Transm~ Brown~ WI
4 G002 Done 1998-12-01 00:00:00 NA NA NA LG&E and KU Ser~ Trimb~ KY
虽然将我的会话 tz 设置为 GMT 并不太麻烦,但我想知道是否有办法让它假设文件与我的本地时区相同并保持原样,或者如果它想假设它在文件中是格林威治标准时间,然后不管我当地的时区如何,都将其保持在格林威治标准时间。
It seems like it's assuming the file is in GMT and then converts the GMT representation of the date to my local time zone (Eastern).
实际上,您看到的时区转换恰好在您打印时发生。如果将数据框保存到变量并在更改当前时区之前和之后打印它,您可以看到这一点:
miso_queue <- read_csv_arrow("GI Interactive Queue.csv", as_data_frame = FALSE, timestamp_parsers = "%m/%d/%Y")
df <- miso_queue %>% collect()
Sys.setenv(TZ="US/Pacific")
test[,"Queue Date"]
# # A tibble: 3,343 × 1
# `Queue Date`
# <dttm>
# 1 2013-09-12 17:00:00
# 2 2012-05-14 17:00:00
# 3 1995-11-07 16:00:00
# 4 1998-11-30 16:00:00
# 5 1998-11-30 16:00:00
# 6 1998-11-30 16:00:00
# 7 1999-02-14 16:00:00
# 8 1999-02-14 16:00:00
# 9 1999-07-29 17:00:00
# 10 1999-08-12 17:00:00
# # … with 3,333 more rows
Sys.setenv(TZ="GMT")
test[,"Queue Date"]
# # A tibble: 3,343 × 1
# `Queue Date`
# <dttm>
# 1 2013-09-13 00:00:00
# 2 2012-05-15 00:00:00
# 3 1995-11-08 00:00:00
# 4 1998-12-01 00:00:00
# 5 1998-12-01 00:00:00
# 6 1998-12-01 00:00:00
# 7 1999-02-15 00:00:00
# 8 1999-02-15 00:00:00
# 9 1999-07-30 00:00:00
# 10 1999-08-13 00:00:00
# # … with 3,333 more rows
但是,在您展示的示例中没有时间数据,因此您最好将该列作为日期而不是时间戳来读取。不幸的是,现在我认为如果您提供整个 table 的架构,Arrow 只允许您现在解析为日期。一种替代方法是在阅读后解析日期列。