使用从 tidyr 收集时保留属性(属性不相同)
Retain attributes when using gather from tidyr (attributes are not identical)
我有一个数据框需要拆分成两个表以满足 Codd 的第三范式。在一个简单的例子中,原始数据框看起来像这样:
library(lubridate)
> (df <- data.frame(hh_id = 1:2,
income = c(55000, 94000),
bday_01 = ymd(c(20150309, 19890211)),
bday_02 = ymd(c(19850911, 20000815)),
gender_01 = factor(c("M", "F")),
gender_02 = factor(c("F", "F"))))
hh_id income bday_01 bday_02 gender_01 gender_02
1 1 55000 2015-03-09 1985-09-11 M F
2 2 94000 1989-02-11 2000-08-15 F F
当我使用 gather 函数时,它警告说属性不相同并且丢失了性别因素和 bday 的 lubridate(或现实世界示例中的其他属性)。是否有一个很好的 tidyr 解决方案来避免丢失每一列的数据类型?
library(tidyr)
> (person <- df %>%
select(hh_id, bday_01:gender_02) %>%
gather(key, value, -hh_id) %>%
separate(key, c("key", "per_num"), sep = "_") %>%
spread(key, value))
hh_id per_num bday gender
1 1 01 1425859200 M
2 1 02 495244800 F
3 2 01 603158400 F
4 2 02 966297600 F
Warning message:
attributes are not identical across measure variables; they will be dropped
> lapply(person, class)
$hh_id
[1] "integer"
$per_num
[1] "character"
$bday
[1] "character"
$gender
[1] "character"
我可以想象一种方法,分别收集具有相同数据类型的每组变量,然后连接所有表,但必须有一个更优雅的解决方案,我错过了。
你好像不喜欢我的。再诱惑你一次
(df <- data.frame(hh_id = 1:2,
income = c(55000, 94000),
bday_01 = ymd(c(20150309, 19890211)),
bday_02 = ymd(c(19850911, 20000815)),
gender_01 = factor(c("M", "F")),
gender_02 = factor(c("F", "F"))))
reshape(df, idvar = 'hh_id', varying = list(3:4, 5:6), direction = 'long',
v.names = c('bday','gender'), timevar = 'per_num')
# hh_id income per_num bday gender
# 1.1 1 55000 1 2015-03-09 M
# 2.1 2 94000 1 1989-02-11 F
# 1.2 1 55000 2 1985-09-11 F
# 2.2 2 94000 2 2000-08-15 F
您可以将日期转换为字符,然后在最后将它们转换回日期:
(person <- df %>%
select(hh_id, bday_01:gender_02) %>%
mutate_each(funs(as.character), contains('bday')) %>%
gather(key, value, -hh_id) %>%
separate(key, c("key", "per_num"), sep = "_") %>%
spread(key, value) %>%
mutate(bday=ymd(bday)))
hh_id per_num bday gender
1 1 01 2015-03-09 M
2 1 02 1985-09-11 F
3 2 01 1989-02-11 F
4 2 02 2000-08-15 F
或者,如果您使用 Date
而不是 POSIXct
,您可以这样做:
(person <- df %>%
select(hh_id, bday_01:gender_02) %>%
gather(per_num1, gender, contains('gender'), convert=TRUE) %>%
gather(per_num2, bday, contains('bday'), convert=TRUE) %>%
mutate(bday=as.Date(bday)) %>%
mutate_each(funs(str_extract(., '\d+')), per_num1, per_num2) %>%
filter(per_num1 == per_num2) %>%
rename(per_num=per_num1) %>%
select(-per_num2))
编辑
您看到的警告:
Warning: attributes are not identical across measure variables; they will be dropped
产生于收集性别列,这些列是因素并具有不同的水平向量(参见 str(df)
)。如果您要将性别列转换为角色,或者如果您要将它们的级别与类似的东西同步,
df <- mutate(df, gender_02 = factor(gender_02, levels=levels(gender_01)))
然后你会看到当你执行
时警告消失了
person <- df %>%
select(hh_id, bday_01:gender_02) %>%
gather(key, value, contains('gender'))
使用 tidyr 1.0.0 可以按如下方式完成:
suppressPackageStartupMessages({
library(tidyr)
library(lubridate)
})
df <- data.frame(hh_id = 1:2,
income = c(55000, 94000),
bday_01 = ymd(c(20150309, 19890211)),
bday_02 = ymd(c(19850911, 20000815)),
gender_01 = factor(c("M", "F")),
gender_02 = factor(c("F", "F")))
pivot_longer(df, -(1:2), names_to = c(".value","per_num"),names_sep = "_" )
#> # A tibble: 4 x 5
#> hh_id income per_num bday gender
#> <int> <dbl> <chr> <date> <fct>
#> 1 1 55000 01 2015-03-09 M
#> 2 1 55000 02 1985-09-11 F
#> 3 2 94000 01 1989-02-11 F
#> 4 2 94000 02 2000-08-15 F
由 reprex package (v0.3.0)
创建于 2019-09-14
我有一个数据框需要拆分成两个表以满足 Codd 的第三范式。在一个简单的例子中,原始数据框看起来像这样:
library(lubridate)
> (df <- data.frame(hh_id = 1:2,
income = c(55000, 94000),
bday_01 = ymd(c(20150309, 19890211)),
bday_02 = ymd(c(19850911, 20000815)),
gender_01 = factor(c("M", "F")),
gender_02 = factor(c("F", "F"))))
hh_id income bday_01 bday_02 gender_01 gender_02
1 1 55000 2015-03-09 1985-09-11 M F
2 2 94000 1989-02-11 2000-08-15 F F
当我使用 gather 函数时,它警告说属性不相同并且丢失了性别因素和 bday 的 lubridate(或现实世界示例中的其他属性)。是否有一个很好的 tidyr 解决方案来避免丢失每一列的数据类型?
library(tidyr)
> (person <- df %>%
select(hh_id, bday_01:gender_02) %>%
gather(key, value, -hh_id) %>%
separate(key, c("key", "per_num"), sep = "_") %>%
spread(key, value))
hh_id per_num bday gender
1 1 01 1425859200 M
2 1 02 495244800 F
3 2 01 603158400 F
4 2 02 966297600 F
Warning message:
attributes are not identical across measure variables; they will be dropped
> lapply(person, class)
$hh_id
[1] "integer"
$per_num
[1] "character"
$bday
[1] "character"
$gender
[1] "character"
我可以想象一种方法,分别收集具有相同数据类型的每组变量,然后连接所有表,但必须有一个更优雅的解决方案,我错过了。
你好像不喜欢我的
(df <- data.frame(hh_id = 1:2,
income = c(55000, 94000),
bday_01 = ymd(c(20150309, 19890211)),
bday_02 = ymd(c(19850911, 20000815)),
gender_01 = factor(c("M", "F")),
gender_02 = factor(c("F", "F"))))
reshape(df, idvar = 'hh_id', varying = list(3:4, 5:6), direction = 'long',
v.names = c('bday','gender'), timevar = 'per_num')
# hh_id income per_num bday gender
# 1.1 1 55000 1 2015-03-09 M
# 2.1 2 94000 1 1989-02-11 F
# 1.2 1 55000 2 1985-09-11 F
# 2.2 2 94000 2 2000-08-15 F
您可以将日期转换为字符,然后在最后将它们转换回日期:
(person <- df %>%
select(hh_id, bday_01:gender_02) %>%
mutate_each(funs(as.character), contains('bday')) %>%
gather(key, value, -hh_id) %>%
separate(key, c("key", "per_num"), sep = "_") %>%
spread(key, value) %>%
mutate(bday=ymd(bday)))
hh_id per_num bday gender
1 1 01 2015-03-09 M
2 1 02 1985-09-11 F
3 2 01 1989-02-11 F
4 2 02 2000-08-15 F
或者,如果您使用 Date
而不是 POSIXct
,您可以这样做:
(person <- df %>%
select(hh_id, bday_01:gender_02) %>%
gather(per_num1, gender, contains('gender'), convert=TRUE) %>%
gather(per_num2, bday, contains('bday'), convert=TRUE) %>%
mutate(bday=as.Date(bday)) %>%
mutate_each(funs(str_extract(., '\d+')), per_num1, per_num2) %>%
filter(per_num1 == per_num2) %>%
rename(per_num=per_num1) %>%
select(-per_num2))
编辑
您看到的警告:
Warning: attributes are not identical across measure variables; they will be dropped
产生于收集性别列,这些列是因素并具有不同的水平向量(参见 str(df)
)。如果您要将性别列转换为角色,或者如果您要将它们的级别与类似的东西同步,
df <- mutate(df, gender_02 = factor(gender_02, levels=levels(gender_01)))
然后你会看到当你执行
时警告消失了person <- df %>%
select(hh_id, bday_01:gender_02) %>%
gather(key, value, contains('gender'))
使用 tidyr 1.0.0 可以按如下方式完成:
suppressPackageStartupMessages({
library(tidyr)
library(lubridate)
})
df <- data.frame(hh_id = 1:2,
income = c(55000, 94000),
bday_01 = ymd(c(20150309, 19890211)),
bday_02 = ymd(c(19850911, 20000815)),
gender_01 = factor(c("M", "F")),
gender_02 = factor(c("F", "F")))
pivot_longer(df, -(1:2), names_to = c(".value","per_num"),names_sep = "_" )
#> # A tibble: 4 x 5
#> hh_id income per_num bday gender
#> <int> <dbl> <chr> <date> <fct>
#> 1 1 55000 01 2015-03-09 M
#> 2 1 55000 02 1985-09-11 F
#> 3 2 94000 01 1989-02-11 F
#> 4 2 94000 02 2000-08-15 F
由 reprex package (v0.3.0)
创建于 2019-09-14