组织数据框——将一列分成三列
Organizing a dataframe - splitting one column into three
我有一个如下所示的数据集:
Ord_ID Supplier Trans_Type Date
1 A PO 2/3/18
1 A Receipt 2/15/18
2 B PO 2/4/18
2 B Receipt 3/13/18
3 C PO 2/7/18
3 C Receipt 3/1/18
3 C Receipt 3/5/18
3 C Receipt 3/29/18
4 B PO 2/9/18
4 B Receipt 2/20/18
4 B Receipt 2/27/18
5 D PO 2/18/18
5 D Receipt 4/2/18
基本上,我需要将日期列分成 3 个不同的列。我需要一个 PO_Date 列,该列列出每个订单的最早收货日期,以及每个订单的最后收货日期。因为有些订单只有一个收货日期,所以第 2 列和第 3 列应该相同。我试过使用 spread()
,但我猜是因为每个订单的收据日期数量不同,所以它不起作用。我怎样才能做到这一点?
想要的结果:
Ord_ID Supplier PO_Date First_Receipt_Date Last_Receipt_Date
1 A 2/3/18 2/15/18 2/15/18
2 B 2/4/18 3/13/18 3/13/18
3 C 2/7/18 3/1/18 3/29/18
4 B 2/9/18 2/20/18 2/27/18
5 D 2/18/18 4/2/18 4/2/18
我会从这样的事情开始:
data %>%
group_by(Supplier, Trans_Type) %>%
summarise(min_date = min(Date),
max_date = max(Date)
) %>%
ungroup()
然后,您可以使用 gather
和 spread
来检索您需要的列。
使用dplyr
。首先,确保 Date
列采用日期格式。假设数据框被命名为 mydata
:
library(dplyr)
mydata <- mydata %>%
mutate(Date = as.Date(Date, "%m/%d/%y")
现在您可以过滤 Receipt
,计算 max/min 日期,然后过滤 PO
的原始数据并将它们连接在一起:
mydata %>%
filter(Trans_Type == "Receipt") %>%
group_by(Ord_ID, Supplier) %>%
summarise(First_Receipt_Date = min(Date),
Last_Receipt_Date = max(Date)) %>%
ungroup() %>%
left_join(filter(mydata, Trans_Type == "PO")) %>%
select(Ord_ID, Supplier, PO_Date = Date, First_Receipt_Date, Last_Receipt_Date)
结果:
Ord_ID Supplier PO_Date First_Receipt_Date Last_Receipt_Date
<int> <chr> <date> <date> <date>
1 1 A 2018-02-03 2018-02-15 2018-02-15
2 2 B 2018-02-04 2018-03-13 2018-03-13
3 3 C 2018-02-07 2018-03-01 2018-03-29
4 4 B 2018-02-09 2018-02-20 2018-02-27
5 5 D 2018-02-18 2018-04-02 2018-04-02
这是另一个基于 tidyverse
的解决方案,它避免了 left_join
。我不知道哪种方法在大型数据集上会更快,但有更多选择总是好的:
df <- structure(list(Ord_ID = c(1L, 1L, 2L, 2L, 3L, 3L, 3L, 3L, 4L,
4L, 4L, 5L, 5L), Supplier = structure(c(1L, 1L, 2L, 2L, 3L, 3L,
3L, 3L, 2L, 2L, 2L, 4L, 4L), .Label = c("A", "B", "C", "D"), class = "factor"),
Trans_Type = c("PO", "Receipt", "PO", "Receipt", "PO", "Receipt",
"Receipt", "Receipt", "PO", "Receipt", "Receipt", "PO", "Receipt"
), Date = structure(c(17565, 17577, 17566, 17603, 17569,
17591, 17595, 17619, 17571, 17582, 17589, 17580, 17623), class = "Date")), row.names = c(NA,
-13L), class = "data.frame")
df %>%
group_by(Ord_ID, Supplier, Trans_Type) %>%
# Keep only min and max date values
filter(Date == min(Date) | Date == max(Date) | Trans_Type != 'Receipt') %>%
# Rename 2nd Receipt value Receipt_2 so there are no duplicated values
mutate(Trans_Type2 = if_else(Trans_Type == 'Receipt' & row_number() == 2,
'Receipt_2', Trans_Type)) %>%
# Drop Trans_Type variable (we can't replace in mutate since it's a grouping var)
ungroup(Trans_Type) %>%
select(-Trans_Type) %>%
# Spread the now unduplicated Trans_Type values
spread(Trans_Type2, Date) %>%
# Fill in Receipt_2 values where they're missing
mutate(Receipt_2 = if_else(is.na(Receipt_2), Receipt, Receipt_2))
# A tibble: 5 x 5
Ord_ID Supplier PO Receipt Receipt_2
<int> <fct> <date> <date> <date>
1 1 A 2018-02-03 2018-02-15 2018-02-15
2 2 B 2018-02-04 2018-03-13 2018-03-13
3 3 C 2018-02-07 2018-03-01 2018-03-29
4 4 B 2018-02-09 2018-02-20 2018-02-27
5 5 D 2018-02-18 2018-04-02 2018-04-02
您可以只使用 dplyr 改变 PO 日期、第一个和最后一个收货日期的新列:
test1<-test %>%
mutate(Date = mdy(Date)) %>%
group_by(Ord_ID) %>%
mutate(PO_Date = ifelse(Trans_Type == "PO", Date, NA),
Receipt_Date_First = min(Date[Trans_Type=="Receipt"]),
Receipt_Date_Last = max(Date[Trans_Type=="Receipt"])) %>%
filter(!is.na(PO_Date)) %>%
mutate(PO_Date = as.Date(as.numeric(PO_Date)))
细分:
test1<-test %>%
#convert format of "Date" column to as.Date to identify min and max dates
mutate(Date = mdy(Date)) %>%
#group by the Order ID
group_by(Ord_ID) %>%
#PO_Date will be where the "Trans_Type" is "PO" --> since the column is in date format,
#dplyr will convert this to numeric, but can be fixed later
mutate(PO_Date = ifelse(Trans_Type == "PO", Date, NA),
#first receipt date is the minimum date of a receipt transaction
Receipt_Date_First = min(Date[Trans_Type=="Receipt"]),
#last receipt date is the maximum date of a receipt transaction
Receipt_Date_Last = max(Date[Trans_Type=="Receipt"])) %>%
#to remove duplicates
filter(!is.na(PO_Date)) %>%
#convert "PO_Date" column back to as.Date from numeric
mutate(PO_Date = as.Date(as.numeric(PO_Date)))
用tidyverse
,借用@divibisan的样本数据:
library(tidyverse)
df %>%
group_by(Ord_ID, Supplier) %>%
slice(c(1:2, n())) %>%
mutate(Trans_Type = c("PO_Date","First_Receipt_Date","Last_Receipt_Date")) %>%
spread(Trans_Type, Date) %>%
ungroup()
# # A tibble: 5 x 5
# Ord_ID Supplier First_Receipt_Date Last_Receipt_Date PO_Date
# <int> <fct> <date> <date> <date>
# 1 1 A 2018-02-15 2018-02-15 2018-02-03
# 2 2 B 2018-03-13 2018-03-13 2018-02-04
# 3 3 C 2018-03-01 2018-03-29 2018-02-07
# 4 4 B 2018-02-20 2018-02-27 2018-02-09
# 5 5 D 2018-04-02 2018-04-02 2018-02-18
如果数据未按示例数据排序,请添加 %>% arrange(Trans_Type, Date)
作为第一步。
我有一个如下所示的数据集:
Ord_ID Supplier Trans_Type Date
1 A PO 2/3/18
1 A Receipt 2/15/18
2 B PO 2/4/18
2 B Receipt 3/13/18
3 C PO 2/7/18
3 C Receipt 3/1/18
3 C Receipt 3/5/18
3 C Receipt 3/29/18
4 B PO 2/9/18
4 B Receipt 2/20/18
4 B Receipt 2/27/18
5 D PO 2/18/18
5 D Receipt 4/2/18
基本上,我需要将日期列分成 3 个不同的列。我需要一个 PO_Date 列,该列列出每个订单的最早收货日期,以及每个订单的最后收货日期。因为有些订单只有一个收货日期,所以第 2 列和第 3 列应该相同。我试过使用 spread()
,但我猜是因为每个订单的收据日期数量不同,所以它不起作用。我怎样才能做到这一点?
想要的结果:
Ord_ID Supplier PO_Date First_Receipt_Date Last_Receipt_Date
1 A 2/3/18 2/15/18 2/15/18
2 B 2/4/18 3/13/18 3/13/18
3 C 2/7/18 3/1/18 3/29/18
4 B 2/9/18 2/20/18 2/27/18
5 D 2/18/18 4/2/18 4/2/18
我会从这样的事情开始:
data %>%
group_by(Supplier, Trans_Type) %>%
summarise(min_date = min(Date),
max_date = max(Date)
) %>%
ungroup()
然后,您可以使用 gather
和 spread
来检索您需要的列。
使用dplyr
。首先,确保 Date
列采用日期格式。假设数据框被命名为 mydata
:
library(dplyr)
mydata <- mydata %>%
mutate(Date = as.Date(Date, "%m/%d/%y")
现在您可以过滤 Receipt
,计算 max/min 日期,然后过滤 PO
的原始数据并将它们连接在一起:
mydata %>%
filter(Trans_Type == "Receipt") %>%
group_by(Ord_ID, Supplier) %>%
summarise(First_Receipt_Date = min(Date),
Last_Receipt_Date = max(Date)) %>%
ungroup() %>%
left_join(filter(mydata, Trans_Type == "PO")) %>%
select(Ord_ID, Supplier, PO_Date = Date, First_Receipt_Date, Last_Receipt_Date)
结果:
Ord_ID Supplier PO_Date First_Receipt_Date Last_Receipt_Date
<int> <chr> <date> <date> <date>
1 1 A 2018-02-03 2018-02-15 2018-02-15
2 2 B 2018-02-04 2018-03-13 2018-03-13
3 3 C 2018-02-07 2018-03-01 2018-03-29
4 4 B 2018-02-09 2018-02-20 2018-02-27
5 5 D 2018-02-18 2018-04-02 2018-04-02
这是另一个基于 tidyverse
的解决方案,它避免了 left_join
。我不知道哪种方法在大型数据集上会更快,但有更多选择总是好的:
df <- structure(list(Ord_ID = c(1L, 1L, 2L, 2L, 3L, 3L, 3L, 3L, 4L,
4L, 4L, 5L, 5L), Supplier = structure(c(1L, 1L, 2L, 2L, 3L, 3L,
3L, 3L, 2L, 2L, 2L, 4L, 4L), .Label = c("A", "B", "C", "D"), class = "factor"),
Trans_Type = c("PO", "Receipt", "PO", "Receipt", "PO", "Receipt",
"Receipt", "Receipt", "PO", "Receipt", "Receipt", "PO", "Receipt"
), Date = structure(c(17565, 17577, 17566, 17603, 17569,
17591, 17595, 17619, 17571, 17582, 17589, 17580, 17623), class = "Date")), row.names = c(NA,
-13L), class = "data.frame")
df %>%
group_by(Ord_ID, Supplier, Trans_Type) %>%
# Keep only min and max date values
filter(Date == min(Date) | Date == max(Date) | Trans_Type != 'Receipt') %>%
# Rename 2nd Receipt value Receipt_2 so there are no duplicated values
mutate(Trans_Type2 = if_else(Trans_Type == 'Receipt' & row_number() == 2,
'Receipt_2', Trans_Type)) %>%
# Drop Trans_Type variable (we can't replace in mutate since it's a grouping var)
ungroup(Trans_Type) %>%
select(-Trans_Type) %>%
# Spread the now unduplicated Trans_Type values
spread(Trans_Type2, Date) %>%
# Fill in Receipt_2 values where they're missing
mutate(Receipt_2 = if_else(is.na(Receipt_2), Receipt, Receipt_2))
# A tibble: 5 x 5
Ord_ID Supplier PO Receipt Receipt_2
<int> <fct> <date> <date> <date>
1 1 A 2018-02-03 2018-02-15 2018-02-15
2 2 B 2018-02-04 2018-03-13 2018-03-13
3 3 C 2018-02-07 2018-03-01 2018-03-29
4 4 B 2018-02-09 2018-02-20 2018-02-27
5 5 D 2018-02-18 2018-04-02 2018-04-02
您可以只使用 dplyr 改变 PO 日期、第一个和最后一个收货日期的新列:
test1<-test %>%
mutate(Date = mdy(Date)) %>%
group_by(Ord_ID) %>%
mutate(PO_Date = ifelse(Trans_Type == "PO", Date, NA),
Receipt_Date_First = min(Date[Trans_Type=="Receipt"]),
Receipt_Date_Last = max(Date[Trans_Type=="Receipt"])) %>%
filter(!is.na(PO_Date)) %>%
mutate(PO_Date = as.Date(as.numeric(PO_Date)))
细分:
test1<-test %>%
#convert format of "Date" column to as.Date to identify min and max dates
mutate(Date = mdy(Date)) %>%
#group by the Order ID
group_by(Ord_ID) %>%
#PO_Date will be where the "Trans_Type" is "PO" --> since the column is in date format,
#dplyr will convert this to numeric, but can be fixed later
mutate(PO_Date = ifelse(Trans_Type == "PO", Date, NA),
#first receipt date is the minimum date of a receipt transaction
Receipt_Date_First = min(Date[Trans_Type=="Receipt"]),
#last receipt date is the maximum date of a receipt transaction
Receipt_Date_Last = max(Date[Trans_Type=="Receipt"])) %>%
#to remove duplicates
filter(!is.na(PO_Date)) %>%
#convert "PO_Date" column back to as.Date from numeric
mutate(PO_Date = as.Date(as.numeric(PO_Date)))
用tidyverse
,借用@divibisan的样本数据:
library(tidyverse)
df %>%
group_by(Ord_ID, Supplier) %>%
slice(c(1:2, n())) %>%
mutate(Trans_Type = c("PO_Date","First_Receipt_Date","Last_Receipt_Date")) %>%
spread(Trans_Type, Date) %>%
ungroup()
# # A tibble: 5 x 5
# Ord_ID Supplier First_Receipt_Date Last_Receipt_Date PO_Date
# <int> <fct> <date> <date> <date>
# 1 1 A 2018-02-15 2018-02-15 2018-02-03
# 2 2 B 2018-03-13 2018-03-13 2018-02-04
# 3 3 C 2018-03-01 2018-03-29 2018-02-07
# 4 4 B 2018-02-20 2018-02-27 2018-02-09
# 5 5 D 2018-04-02 2018-04-02 2018-02-18
如果数据未按示例数据排序,请添加 %>% arrange(Trans_Type, Date)
作为第一步。