取消透视数据并填充空值以从 R 中的状态更改获取当前状态
Unpivoting data and filling null values to get current state from state changes in R
我已经在网上搜索了有关此问题的解决方案,但是无法找到该特定主题的答案。
我有一个数据框显示 19 个不同单位的数据,每个单位有 2 个部门。这些部门可以处于不同的状态(状态1-5),并且必须始终处于一个状态。
数据本身是这样的:
Time department fromState toState Date
46051.41923 unit36:depr2 4 5 2017-05-22 10:47
46077.33833 unit37:depr1 3 4 2017-05-22 10:47
47057.31889 unit31:depr2 2 3 2017-05-22 11:04
47062.31889 unit31:depr1 3 6 2017-05-22 11:04
数据显示每个部门在状态更改之前 (fromState) 和状态更改之后 (toState) 的状态。
我更想拥有的是:
Date unit36:depr2 unit37:depr1 unit31:depr2 unit31:depr1
2017-05-22 10:47 5 4 2 3
2017-05-22 11:04 5 4 3 6
这样我就可以随时查看所有单位的状态。正如您可能看到的那样,我已将 fromState 和 toState 手动合并到名称为 unit&depr 的列中。我还删除了重复的日期值。时间栏是用来制作日期栏的,也被删除了。
有什么方法可以以非手动方式执行此操作吗?
我建议您以不同的方式看待您的数据。除了 "from" 和 "to",我认为您应该考虑 "initial state",然后在它发生时调用更改。使用你的数据(这里加"x"方便对付read.table(text=...)
,请使用你自己的数据):
x <- read.table(text='Time department fromState toState Date x
46051.41923 unit36:depr2 4 5 2017-05-22 10:47
46077.33833 unit37:depr1 3 4 2017-05-22 10:47
47057.31889 unit31:depr2 2 3 2017-05-22 11:04
47062.31889 unit31:depr1 3 6 2017-05-22 11:04', header=TRUE, stringsAsFactors=FALSE)
x$Date <- as.POSIXct(paste(x$Date, x$x))
x$x <- NULL
我将为此使用两个库,因为我认为它们在这里很合适并且易于阅读。我相信有人可以建议 data.table
(可能更快)和 base-R(而不是 package-dependent)解决方案。
library(dplyr)
library(tidyr)
首先是确定所有部门的起始状态。 (0
日期正好是 "before anything else happened"。)
initial_state <- x %>%
arrange(Date) %>%
group_by(department) %>%
summarize(
date = as.POSIXct(0, origin='1970-01-01'),
state = fromState[1]
)
initial_state
# # A tibble: 4 × 3
# department date state
# <chr> <dttm> <int>
# 1 unit31:depr1 1970-01-01 3
# 2 unit31:depr2 1970-01-01 2
# 3 unit36:depr2 1970-01-01 4
# 4 unit37:depr1 1970-01-01 3
当事情发生变化时立即记录:
transitions <- select(x, department, date = Date, state = toState)
transitions
# department date state
# 1 unit36:depr2 2017-05-22 10:47:00 5
# 2 unit37:depr1 2017-05-22 10:47:00 4
# 3 unit31:depr2 2017-05-22 11:04:00 3
# 4 unit31:depr1 2017-05-22 11:04:00 6
下一步是取消透视:
bind_rows(initial_state, transitions) %>%
spread(department, state)
# # A tibble: 3 × 5
# date `unit31:depr1` `unit31:depr2` `unit36:depr2` `unit37:depr1`
# * <dttm> <int> <int> <int> <int>
# 1 1970-01-01 00:00:00 3 2 4 3
# 2 2017-05-22 10:47:00 NA NA 5 4
# 3 2017-05-22 11:04:00 6 3 NA NA
...意识到 NA
意味着 "nothing happened this day for this department, so carry-forward from the previous non-NA
row"。幸运的是,zoo
包中有一个函数可以做到这一点:
na.locf package:zoo R Documentation
Last Observation Carried Forward
Description:
Generic function for replacing each 'NA' with the most recent
non-'NA' prior to it.
library(zoo) # for clarity, not strictly requires since I use '::' here
bind_rows(initial_state, transitions) %>%
spread(department, state) %>%
mutate_all(zoo::na.locf) %>%
filter(date > 0) # since I no longer need the "0" date
# # A tibble: 2 × 5
# date `unit31:depr1` `unit31:depr2` `unit36:depr2` `unit37:depr1`
# <dttm> <int> <int> <int> <int>
# 1 2017-05-22 10:47:00 3 2 5 4
# 2 2017-05-22 11:04:00 6 3 5 4
这是使用 tidyverse 函数的另一种策略。首先,您的数据
library(tidyverse)
dd <- read_csv("Time,department,fromState,toState,Date
46051.41923,unit36:depr2,4,5,2017-05-22 10:47
46077.33833,unit37:depr1,3,4,2017-05-22 10:47
47057.31889,unit31:depr2,2,3,2017-05-22 11:04
47062.31889,unit31:depr1,3,6,2017-05-22 11:04")
现在我得到每个部门的第一个日期
start <- dd %>%
group_by(department) %>%
summarize(state=first(fromState)) %>%
spread(department, state)
现在对于每个日期,我都会得到所有当前状态
changes <- dd %>%
arrange(Date) %>%
select(Date, department, toState) %>%
split(.$Date) %>%
map(spread, department, toState)
然后我使用 accumulate
来 "replay" 每个日期的变化。
alt_list_modify <- function(x, y) list_modify(x, !!!y)
final <- accumulate(changes, alt_list_modify, .init = start) %>%
tail(-1) %>% bind_rows()
这returns想要的结果
# A tibble: 2 x 5
`unit31:depr1` `unit31:depr2` `unit36:depr2` `unit37:depr1` Date
<int> <int> <int> <int> <dttm>
1 3 2 5 4 2017-05-22 10:47:00
2 6 3 5 4 2017-05-22 11:04:00
我已经在网上搜索了有关此问题的解决方案,但是无法找到该特定主题的答案。
我有一个数据框显示 19 个不同单位的数据,每个单位有 2 个部门。这些部门可以处于不同的状态(状态1-5),并且必须始终处于一个状态。
数据本身是这样的:
Time department fromState toState Date
46051.41923 unit36:depr2 4 5 2017-05-22 10:47
46077.33833 unit37:depr1 3 4 2017-05-22 10:47
47057.31889 unit31:depr2 2 3 2017-05-22 11:04
47062.31889 unit31:depr1 3 6 2017-05-22 11:04
数据显示每个部门在状态更改之前 (fromState) 和状态更改之后 (toState) 的状态。
我更想拥有的是:
Date unit36:depr2 unit37:depr1 unit31:depr2 unit31:depr1
2017-05-22 10:47 5 4 2 3
2017-05-22 11:04 5 4 3 6
这样我就可以随时查看所有单位的状态。正如您可能看到的那样,我已将 fromState 和 toState 手动合并到名称为 unit&depr 的列中。我还删除了重复的日期值。时间栏是用来制作日期栏的,也被删除了。
有什么方法可以以非手动方式执行此操作吗?
我建议您以不同的方式看待您的数据。除了 "from" 和 "to",我认为您应该考虑 "initial state",然后在它发生时调用更改。使用你的数据(这里加"x"方便对付read.table(text=...)
,请使用你自己的数据):
x <- read.table(text='Time department fromState toState Date x
46051.41923 unit36:depr2 4 5 2017-05-22 10:47
46077.33833 unit37:depr1 3 4 2017-05-22 10:47
47057.31889 unit31:depr2 2 3 2017-05-22 11:04
47062.31889 unit31:depr1 3 6 2017-05-22 11:04', header=TRUE, stringsAsFactors=FALSE)
x$Date <- as.POSIXct(paste(x$Date, x$x))
x$x <- NULL
我将为此使用两个库,因为我认为它们在这里很合适并且易于阅读。我相信有人可以建议 data.table
(可能更快)和 base-R(而不是 package-dependent)解决方案。
library(dplyr)
library(tidyr)
首先是确定所有部门的起始状态。 (0
日期正好是 "before anything else happened"。)
initial_state <- x %>%
arrange(Date) %>%
group_by(department) %>%
summarize(
date = as.POSIXct(0, origin='1970-01-01'),
state = fromState[1]
)
initial_state
# # A tibble: 4 × 3
# department date state
# <chr> <dttm> <int>
# 1 unit31:depr1 1970-01-01 3
# 2 unit31:depr2 1970-01-01 2
# 3 unit36:depr2 1970-01-01 4
# 4 unit37:depr1 1970-01-01 3
当事情发生变化时立即记录:
transitions <- select(x, department, date = Date, state = toState)
transitions
# department date state
# 1 unit36:depr2 2017-05-22 10:47:00 5
# 2 unit37:depr1 2017-05-22 10:47:00 4
# 3 unit31:depr2 2017-05-22 11:04:00 3
# 4 unit31:depr1 2017-05-22 11:04:00 6
下一步是取消透视:
bind_rows(initial_state, transitions) %>%
spread(department, state)
# # A tibble: 3 × 5
# date `unit31:depr1` `unit31:depr2` `unit36:depr2` `unit37:depr1`
# * <dttm> <int> <int> <int> <int>
# 1 1970-01-01 00:00:00 3 2 4 3
# 2 2017-05-22 10:47:00 NA NA 5 4
# 3 2017-05-22 11:04:00 6 3 NA NA
...意识到 NA
意味着 "nothing happened this day for this department, so carry-forward from the previous non-NA
row"。幸运的是,zoo
包中有一个函数可以做到这一点:
na.locf package:zoo R Documentation
Last Observation Carried Forward
Description:
Generic function for replacing each 'NA' with the most recent non-'NA' prior to it.
library(zoo) # for clarity, not strictly requires since I use '::' here
bind_rows(initial_state, transitions) %>%
spread(department, state) %>%
mutate_all(zoo::na.locf) %>%
filter(date > 0) # since I no longer need the "0" date
# # A tibble: 2 × 5
# date `unit31:depr1` `unit31:depr2` `unit36:depr2` `unit37:depr1`
# <dttm> <int> <int> <int> <int>
# 1 2017-05-22 10:47:00 3 2 5 4
# 2 2017-05-22 11:04:00 6 3 5 4
这是使用 tidyverse 函数的另一种策略。首先,您的数据
library(tidyverse)
dd <- read_csv("Time,department,fromState,toState,Date
46051.41923,unit36:depr2,4,5,2017-05-22 10:47
46077.33833,unit37:depr1,3,4,2017-05-22 10:47
47057.31889,unit31:depr2,2,3,2017-05-22 11:04
47062.31889,unit31:depr1,3,6,2017-05-22 11:04")
现在我得到每个部门的第一个日期
start <- dd %>%
group_by(department) %>%
summarize(state=first(fromState)) %>%
spread(department, state)
现在对于每个日期,我都会得到所有当前状态
changes <- dd %>%
arrange(Date) %>%
select(Date, department, toState) %>%
split(.$Date) %>%
map(spread, department, toState)
然后我使用 accumulate
来 "replay" 每个日期的变化。
alt_list_modify <- function(x, y) list_modify(x, !!!y)
final <- accumulate(changes, alt_list_modify, .init = start) %>%
tail(-1) %>% bind_rows()
这returns想要的结果
# A tibble: 2 x 5
`unit31:depr1` `unit31:depr2` `unit36:depr2` `unit37:depr1` Date
<int> <int> <int> <int> <dttm>
1 3 2 5 4 2017-05-22 10:47:00
2 6 3 5 4 2017-05-22 11:04:00