整理数据:使用模式将多行收集到列中
Tidy data: gather multiple rows into columns using a pattern
我的数据框不整齐:
id 16
pol_pup1.irf_pol1_pub1 0.0186380741
pol_pup1.lower_pol1_pub1 0.0092071786
pol_pup1.upper_pol1_pub1 0.0289460145
pol_pup10.irf_pol10_pub10 0.0061496499
pol_pup10.lower_pol10_pub10 0.0030948510
pol_pup10.upper_pol10_pub10 0.0080107893
pol_pup105.irf_pol105_pub105 0.0377057491
pol_pup105.lower_pol105_pub105 0.0157756274
pol_pup105.upper_pol105_pub105 0.0610782151
pol_pup111.irf_pol111_pub111 0.0169799646
pol_pup111.lower_pol111_pub111 0.0111885580
pol_pup111.upper_pol111_pub111 0.0217701354
pol_pup112.irf_pol112_pub112 0.0156278416
pol_pup112.lower_pol112_pub112 -0.0043273923
pol_pup112.upper_pol112_pub112 0.0342078865
pol_pup113.irf_pol113_pub113 0.0280868673
pol_pup113.lower_pol113_pub113 0.0203300863
pol_pup113.upper_pol113_pub113 0.0366594965
pol_pup114.irf_pol114_pub114 0.0086282368
and so on with different numbers
如何制作一个数据框,其中 'IRF'、'lower' 和 'upper' 有一个单独的列,并且 'id' 列中的每个数字是单个观察,如下所示:
Observation IRF Lower Upper
1 0.018 0.009 0.028
10 0.006 0.003 0.008
105 0.037 0.015 0.061
111 0.016 0.011 0.021
我不确定您的数据框的一致性如何,但对此可能会有一些变化。我假设您将数字列命名为“16”
df %>%
mutate(
obs = str_extract(id, '[0-9]+'),
group = str_extract(id, 'irf|lower|upper')
) %>%
select(-id) %>%
pivot_wider(
names_from = group,
values_from = `16`
)
这是 separate
来自 tidyr
的方法:
一旦第一列被分成其他列,我们就可以使用正则表达式和 str_extract
从 stringr
中提取值。 "[a-z]+$"
模式匹配任何小写字母一次或多次,后跟字符串结尾。
然后我们可以使用 tidyr
中的 pivot_wider
。
library(tidyr)
library(dplyr)
library(stringr)
data %>%
separate(id,sep = "_", into = c("Pol","Value","Observation","Pub")) %>%
mutate(Value = str_extract(Value,"[a-z]+$"),
Observation = str_extract(Observation,"[0-9]+$")) %>%
dplyr::select(-Pol,-Pub) %>%
pivot_wider(names_from = Value, values_from = last_col())
# A tibble: 7 x 4
Observation irf lower upper
<chr> <dbl> <dbl> <dbl>
1 1 0.0186 0.00921 0.0289
2 10 0.00615 0.00309 0.00801
3 105 0.0377 0.0158 0.0611
4 111 0.0170 0.0112 0.0218
5 112 0.0156 -0.00433 0.0342
6 113 0.0281 0.0203 0.0367
7 114 0.00863 NA NA
我的数据框不整齐:
id 16
pol_pup1.irf_pol1_pub1 0.0186380741
pol_pup1.lower_pol1_pub1 0.0092071786
pol_pup1.upper_pol1_pub1 0.0289460145
pol_pup10.irf_pol10_pub10 0.0061496499
pol_pup10.lower_pol10_pub10 0.0030948510
pol_pup10.upper_pol10_pub10 0.0080107893
pol_pup105.irf_pol105_pub105 0.0377057491
pol_pup105.lower_pol105_pub105 0.0157756274
pol_pup105.upper_pol105_pub105 0.0610782151
pol_pup111.irf_pol111_pub111 0.0169799646
pol_pup111.lower_pol111_pub111 0.0111885580
pol_pup111.upper_pol111_pub111 0.0217701354
pol_pup112.irf_pol112_pub112 0.0156278416
pol_pup112.lower_pol112_pub112 -0.0043273923
pol_pup112.upper_pol112_pub112 0.0342078865
pol_pup113.irf_pol113_pub113 0.0280868673
pol_pup113.lower_pol113_pub113 0.0203300863
pol_pup113.upper_pol113_pub113 0.0366594965
pol_pup114.irf_pol114_pub114 0.0086282368
and so on with different numbers
如何制作一个数据框,其中 'IRF'、'lower' 和 'upper' 有一个单独的列,并且 'id' 列中的每个数字是单个观察,如下所示:
Observation IRF Lower Upper
1 0.018 0.009 0.028
10 0.006 0.003 0.008
105 0.037 0.015 0.061
111 0.016 0.011 0.021
我不确定您的数据框的一致性如何,但对此可能会有一些变化。我假设您将数字列命名为“16”
df %>%
mutate(
obs = str_extract(id, '[0-9]+'),
group = str_extract(id, 'irf|lower|upper')
) %>%
select(-id) %>%
pivot_wider(
names_from = group,
values_from = `16`
)
这是 separate
来自 tidyr
的方法:
一旦第一列被分成其他列,我们就可以使用正则表达式和 str_extract
从 stringr
中提取值。 "[a-z]+$"
模式匹配任何小写字母一次或多次,后跟字符串结尾。
然后我们可以使用 tidyr
中的 pivot_wider
。
library(tidyr)
library(dplyr)
library(stringr)
data %>%
separate(id,sep = "_", into = c("Pol","Value","Observation","Pub")) %>%
mutate(Value = str_extract(Value,"[a-z]+$"),
Observation = str_extract(Observation,"[0-9]+$")) %>%
dplyr::select(-Pol,-Pub) %>%
pivot_wider(names_from = Value, values_from = last_col())
# A tibble: 7 x 4
Observation irf lower upper
<chr> <dbl> <dbl> <dbl>
1 1 0.0186 0.00921 0.0289
2 10 0.00615 0.00309 0.00801
3 105 0.0377 0.0158 0.0611
4 111 0.0170 0.0112 0.0218
5 112 0.0156 -0.00433 0.0342
6 113 0.0281 0.0203 0.0367
7 114 0.00863 NA NA