R:从 SQL 导出数据后无法将字符串转换为正确的日期时间格式

R: Trouble converting string to proper date time format after exporting data from SQL

我从一开始就承认我是 R 的新手,我唯一的其他 "programming" 经验是在 MATLAB 环境中。

我已经 运行在与我的问题相关的 Whosebug 上浏览了许多 posts,但尚未找到与我的确切问题匹配的 post,所以我选择 post 在这里。

问题定义

将数据(捕获与测量设备相关的信息)从 SQL 导出到 csv 文件后,我使用 read.table 命令将数据导入 R,如下所示:

tbl = read.csv("myfile.csv", sep = ",", header = TRUE, stringsAsFactors = FALSE);

这提供了一个数据框,其中包含 8 个变量的超过 17, 000 个观察值。在这 8 个变量中,只有最后 2 个(列)感兴趣(ReadingTime 和 Reading)所以我将这些数据帧减少到 df 如下:

df = tbl[,c(7,8)];

出于可视化目的,df 的前 25 个元素如下所示:

    df[c(1:25),]
               ReadingTime Reading
1  2015-Dec-31 11:00:00 PM    3.52
2  2015-Dec-31 10:00:00 PM    3.97
3   2015-Dec-31 9:00:00 PM    3.85
4   2015-Dec-31 8:00:00 PM    3.94
5   2015-Dec-31 7:00:00 PM    4.47
6   2015-Dec-31 6:00:00 PM    4.75
7   2015-Dec-31 5:00:00 PM    6.58
8   2015-Dec-31 4:00:00 PM    6.99
9   2015-Dec-31 3:00:00 PM    7.50
10  2015-Dec-31 2:00:00 PM    6.28
11  2015-Dec-31 1:00:00 PM    6.16
12 2015-Dec-31 12:00:00 PM    4.49
13 2015-Dec-31 11:00:00 AM    4.30
14 2015-Dec-31 10:00:00 AM    4.27
15  2015-Dec-31 9:00:00 AM    4.54
16  2015-Dec-31 8:00:00 AM    4.30
17  2015-Dec-31 7:00:00 AM    4.52
18  2015-Dec-31 6:00:00 AM    4.65
19  2015-Dec-31 5:00:00 AM    4.25
20  2015-Dec-31 4:00:00 AM    4.45
21  2015-Dec-31 3:00:00 AM    4.26
22  2015-Dec-31 2:00:00 AM    5.02
23  2015-Dec-31 1:00:00 AM    5.17
24             2015-Dec-31    5.44
25 2015-Dec-30 11:00:00 PM    5.53

Objective

我现在想将 df 转换为具有适当日期时间格式的 xts 对象,以便我可以创建汇总统计信息并对我的数据执行转换(例如,将小时时间序列转换为每日、每周等时间序列),并最终在预测练习中使用 xts 对象。

遇到的困难

当尝试将 ReadingTime 从 df(即字符格式的日期时间)转换为 xts 识别的日期时间格式时,我 运行 出现了日期时间发生在午夜的问题。举例如下:

strptime(df[,1], "%Y-%b-%d %H:%M:%S %p",tz="GMT");
  df[c(1:25),1]
 [1] "2015-12-31 11:00:00 GMT" "2015-12-31 10:00:00 GMT" "2015-12-31 09:00:00 GMT"
 [4] "2015-12-31 08:00:00 GMT" "2015-12-31 07:00:00 GMT" "2015-12-31 06:00:00 GMT"
 [7] "2015-12-31 05:00:00 GMT" "2015-12-31 04:00:00 GMT" "2015-12-31 03:00:00 GMT"
[10] "2015-12-31 02:00:00 GMT" "2015-12-31 01:00:00 GMT" "2015-12-31 12:00:00 GMT"
[13] "2015-12-31 11:00:00 GMT" "2015-12-31 10:00:00 GMT" "2015-12-31 09:00:00 GMT"
[16] "2015-12-31 08:00:00 GMT" "2015-12-31 07:00:00 GMT" "2015-12-31 06:00:00 GMT"
[19] "2015-12-31 05:00:00 GMT" "2015-12-31 04:00:00 GMT" "2015-12-31 03:00:00 GMT"
[22] "2015-12-31 02:00:00 GMT" "2015-12-31 01:00:00 GMT" NA                       
[25] "2015-12-30 11:00:00 GMT"

问题:

我的三个问题如下:1) 为什么 AM/PM 没有得到维护 - 如何修复它(下面的@HubertL 已经解决了这个问题 )?; 2) 如何克服 [24] 处的 NA,并将其转换为正确的格式? AND 3) 如何将 df 转换为 xts 对象?

@HubertL 提出的解决方案 到目前为止,@HubertL 已经解决了 Q1。 @HubertL 提出的答案 2 (A2) 的第一部分将 ReadingTime 的各个组件分开,并向 df 添加另一列,如下所示:

> df[c(1:25),]
               ReadingTime Reading                 dateSplit
1  2015-Dec-31 11:00:00 PM    3.52 2015-Dec-31, 11:00:00, PM
2  2015-Dec-31 10:00:00 PM    3.97 2015-Dec-31, 10:00:00, PM
3   2015-Dec-31 9:00:00 PM    3.85  2015-Dec-31, 9:00:00, PM
4   2015-Dec-31 8:00:00 PM    3.94  2015-Dec-31, 8:00:00, PM
5   2015-Dec-31 7:00:00 PM    4.47  2015-Dec-31, 7:00:00, PM
6   2015-Dec-31 6:00:00 PM    4.75  2015-Dec-31, 6:00:00, PM
7   2015-Dec-31 5:00:00 PM    6.58  2015-Dec-31, 5:00:00, PM
8   2015-Dec-31 4:00:00 PM    6.99  2015-Dec-31, 4:00:00, PM
9   2015-Dec-31 3:00:00 PM    7.50  2015-Dec-31, 3:00:00, PM
10  2015-Dec-31 2:00:00 PM    6.28  2015-Dec-31, 2:00:00, PM
11  2015-Dec-31 1:00:00 PM    6.16  2015-Dec-31, 1:00:00, PM
12 2015-Dec-31 12:00:00 PM    4.49 2015-Dec-31, 12:00:00, PM
13 2015-Dec-31 11:00:00 AM    4.30 2015-Dec-31, 11:00:00, AM
14 2015-Dec-31 10:00:00 AM    4.27 2015-Dec-31, 10:00:00, AM
15  2015-Dec-31 9:00:00 AM    4.54  2015-Dec-31, 9:00:00, AM
16  2015-Dec-31 8:00:00 AM    4.30  2015-Dec-31, 8:00:00, AM
17  2015-Dec-31 7:00:00 AM    4.52  2015-Dec-31, 7:00:00, AM
18  2015-Dec-31 6:00:00 AM    4.65  2015-Dec-31, 6:00:00, AM
19  2015-Dec-31 5:00:00 AM    4.25  2015-Dec-31, 5:00:00, AM
20  2015-Dec-31 4:00:00 AM    4.45  2015-Dec-31, 4:00:00, AM
21  2015-Dec-31 3:00:00 AM    4.26  2015-Dec-31, 3:00:00, AM
22  2015-Dec-31 2:00:00 AM    5.02  2015-Dec-31, 2:00:00, AM
23  2015-Dec-31 1:00:00 AM    5.17  2015-Dec-31, 1:00:00, AM
24             2015-Dec-31    5.44               2015-Dec-31
25 2015-Dec-30 11:00:00 PM    5.53 2015-Dec-30, 11:00:00, PM

现在,当 运行A2 的第二行代码时,我 运行 遇到了问题,即建议的 lengths 函数对于我的 R (3.1.1) 版本不存在), 所以我用 length 函数代替了,这样可以吗? Anyway 运行A2的第二行和第三行代码的结果如下:

> df[c(1:25),]
               ReadingTime Reading                 dateSplit                date
1  2015-Dec-31 11:00:00 PM    3.52 2015-Dec-31, 11:00:00, PM 2015-12-31 23:00:00
2  2015-Dec-31 10:00:00 PM    3.97 2015-Dec-31, 10:00:00, PM 2015-12-31 22:00:00
3   2015-Dec-31 9:00:00 PM    3.85  2015-Dec-31, 9:00:00, PM 2015-12-31 21:00:00
4   2015-Dec-31 8:00:00 PM    3.94  2015-Dec-31, 8:00:00, PM 2015-12-31 20:00:00
5   2015-Dec-31 7:00:00 PM    4.47  2015-Dec-31, 7:00:00, PM 2015-12-31 19:00:00
6   2015-Dec-31 6:00:00 PM    4.75  2015-Dec-31, 6:00:00, PM 2015-12-31 18:00:00
7   2015-Dec-31 5:00:00 PM    6.58  2015-Dec-31, 5:00:00, PM 2015-12-31 17:00:00
8   2015-Dec-31 4:00:00 PM    6.99  2015-Dec-31, 4:00:00, PM 2015-12-31 16:00:00
9   2015-Dec-31 3:00:00 PM    7.50  2015-Dec-31, 3:00:00, PM 2015-12-31 15:00:00
10  2015-Dec-31 2:00:00 PM    6.28  2015-Dec-31, 2:00:00, PM 2015-12-31 14:00:00
11  2015-Dec-31 1:00:00 PM    6.16  2015-Dec-31, 1:00:00, PM 2015-12-31 13:00:00
12 2015-Dec-31 12:00:00 PM    4.49 2015-Dec-31, 12:00:00, PM 2015-12-31 12:00:00
13 2015-Dec-31 11:00:00 AM    4.30 2015-Dec-31, 11:00:00, AM 2015-12-31 11:00:00
14 2015-Dec-31 10:00:00 AM    4.27 2015-Dec-31, 10:00:00, AM 2015-12-31 10:00:00
15  2015-Dec-31 9:00:00 AM    4.54  2015-Dec-31, 9:00:00, AM 2015-12-31 09:00:00
16  2015-Dec-31 8:00:00 AM    4.30  2015-Dec-31, 8:00:00, AM 2015-12-31 08:00:00
17  2015-Dec-31 7:00:00 AM    4.52  2015-Dec-31, 7:00:00, AM 2015-12-31 07:00:00
18  2015-Dec-31 6:00:00 AM    4.65  2015-Dec-31, 6:00:00, AM 2015-12-31 06:00:00
19  2015-Dec-31 5:00:00 AM    4.25  2015-Dec-31, 5:00:00, AM 2015-12-31 05:00:00
20  2015-Dec-31 4:00:00 AM    4.45  2015-Dec-31, 4:00:00, AM 2015-12-31 04:00:00
21  2015-Dec-31 3:00:00 AM    4.26  2015-Dec-31, 3:00:00, AM 2015-12-31 03:00:00
22  2015-Dec-31 2:00:00 AM    5.02  2015-Dec-31, 2:00:00, AM 2015-12-31 02:00:00
23  2015-Dec-31 1:00:00 AM    5.17  2015-Dec-31, 1:00:00, AM 2015-12-31 01:00:00
24             2015-Dec-31    5.44               2015-Dec-31                <NA>
25 2015-Dec-30 11:00:00 PM    5.53 2015-Dec-30, 11:00:00, PM 2015-12-30 23:00:00

您可以看到 [24] 的 NA 仍然存在。当我应用为 A3 建议的代码时,此问题导致午夜发生的所有测量值都从 xts 对象中披露。即:

> df[c(1:25),]
                    [,1]
2014-01-01 01:00:00 4.67
2014-01-01 02:00:00 4.78
2014-01-01 03:00:00 4.87
2014-01-01 04:00:00 4.61
2014-01-01 05:00:00 4.58
2014-01-01 06:00:00 4.47
2014-01-01 07:00:00 4.66
2014-01-01 08:00:00 4.46
2014-01-01 09:00:00 4.57
2014-01-01 10:00:00 4.87
2014-01-01 11:00:00 4.57
2014-01-01 12:00:00 4.67
2014-01-01 13:00:00 5.52
2014-01-01 14:00:00 6.42
2014-01-01 15:00:00 6.79
2014-01-01 16:00:00 6.50
2014-01-01 17:00:00 5.81
2014-01-01 18:00:00 5.65
2014-01-01 19:00:00 6.25
2014-01-01 20:00:00 5.79
2014-01-01 21:00:00 5.84
2014-01-01 22:00:00 6.06
2014-01-01 23:00:00 4.74
2014-01-02 01:00:00 4.66
2014-01-02 02:00:00 5.59

如果能帮助解决最后几个问题,我们将不胜感激!

答案 1:使用 %I 而不是 %H

test = strptime(..., "%Y-%b-%d %I:%M:%S %p",tz="GMT")

答案 2:

df$dateSplit <- strsplit( df$ReadingTime, " ")
df[lengths(df$dateSplit)<3,"ReadingTime"] <- 
    format(
        strptime(df$date[lengths(df$dateSplit)<3], "%Y-%b-%d", tz="GMT"),
        "%Y-%b-%d %I:%M:%S %p", tz="GMT")
df$date <- strptime(df$ReadingTime, "%Y-%b-%d %I:%M:%S %p", tz="GMT")

答案 3:

require(xts)
xts(df$Reading,df$date)