将重复行转换为列
Transform duplicate rows to columns
我正在处理一个包含数百个变量的数据库,但是,由于它的来源是 JSON,这让我很难组织它。例如,它不是在列中提供信息的文件,而是创建新行。请参阅示例。
df1 <- data_frame(ID = c(111,111,111,111,111,111,222,222,333),
NAME = c('JOHN','JOHN','MARY','MARY','JAMES','JAMES','WILL','WILL','MARK'),
ADRESS = c('NY','NY','NY','NY','ROMA','ROMA','LONDON','TOKYO',''),
COLOR = c('GREEN','GREEN','RED','RED','YELLOW','YELLOW','BLUE','BLUE','ORANGE'),
CAR = c('','','BMW','BMW','TRUCK','TRUCK','FORD','FORD','FERRARI'),
COUNTRY = c('USA','USA','USA','USA','USA','USA','USA','USA','USA'))
我想以按 ID 分组的方式组织文件,如下例所示:
df2 <- data_frame(ID = c(111,222,333),
NAME1 = c('JOHN','WILL','MARK'),
NAME2 = c('MARY','',''),
NAME3 = c('JAMES','',''),
ADRESS1 = c('NY','LONDON',''),
ADRESS2 = c('NY','TOKYO',''),
ADRESS3 = c('ROMA','',''),
COLOR1 = c('GREEN','BLUE','ORANGE'),
COLOR2 = c('RED','',''),
COLOR3 = c('YELLOW','',''),
CAR1 = c('','FORD','FERRARI'),
CAR2 = c('BMW','',''),
CAR3 = c('TRUCK','',''),
COUNTRY = c('USA','USA','USA'))
但是请注意,COUNTRY
变量不需要有很多列(COUNTRY1、COUNTRY2、COUNTRY3),因为结果是重复的。在我的原始文件中,我会发现无数这样的情况。
如何在df2中均匀排列数据?
也许我们可以使用 reshape
尝试以下基本 R 代码
u <- reshape(
transform(
unique(df1),
GRP = ave(seq_along(ID), ID, FUN = seq_along)
),
direction = "wide",
idvar = "ID",
timevar = "GRP"
)
u[order(match(gsub("\.\d+", "", names(u)), names(df1)))]
这给出了
> u[order(match(gsub("\.\d+", "", names(u)), names(df1)))]
ID NAME.1 NAME.2 NAME.3 ADRESS.1 ADRESS.2 ADRESS.3 COLOR.1 COLOR.2 COLOR.3
1 111 JOHN MARY JAMES NY NY ROMA GREEN RED YELLOW
7 222 WILL WILL <NA> LONDON TOKYO <NA> BLUE BLUE <NA>
9 333 MARK <NA> <NA> <NA> <NA> ORANGE <NA> <NA>
CAR.1 CAR.2 CAR.3 COUNTRY.1 COUNTRY.2 COUNTRY.3
1 BMW TRUCK USA USA USA
7 FORD FORD <NA> USA USA <NA>
9 FERRARI <NA> <NA> USA <NA> <NA>
pivot_wider
也有一个选项
library(dplyr)
library(tidyr)
library(data.table)
distinct(df1) %>%
mutate(rn = rowid(ID)) %>%
pivot_wider(names_from = rn, values_from = NAME:CAR,
names_sep = "", values_fill = "") %>%
select(-COUNTRY, COUNTRY)
-输出
# A tibble: 3 × 14
ID NAME1 NAME2 NAME3 ADRESS1 ADRESS2 ADRESS3 COLOR1 COLOR2 COLOR3 CAR1 CAR2 CAR3 COUNTRY
<dbl> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 111 JOHN "MARY" "JAMES" "NY" "NY" "ROMA" GREEN "RED" "YELLOW" "" "BMW" "TRUCK" USA
2 222 WILL "WILL" "" "LONDON" "TOKYO" "" BLUE "BLUE" "" "FORD" "FORD" "" USA
3 333 MARK "" "" "" "" "" ORANGE "" "" "FERRARI" "" "" USA
我正在处理一个包含数百个变量的数据库,但是,由于它的来源是 JSON,这让我很难组织它。例如,它不是在列中提供信息的文件,而是创建新行。请参阅示例。
df1 <- data_frame(ID = c(111,111,111,111,111,111,222,222,333),
NAME = c('JOHN','JOHN','MARY','MARY','JAMES','JAMES','WILL','WILL','MARK'),
ADRESS = c('NY','NY','NY','NY','ROMA','ROMA','LONDON','TOKYO',''),
COLOR = c('GREEN','GREEN','RED','RED','YELLOW','YELLOW','BLUE','BLUE','ORANGE'),
CAR = c('','','BMW','BMW','TRUCK','TRUCK','FORD','FORD','FERRARI'),
COUNTRY = c('USA','USA','USA','USA','USA','USA','USA','USA','USA'))
我想以按 ID 分组的方式组织文件,如下例所示:
df2 <- data_frame(ID = c(111,222,333),
NAME1 = c('JOHN','WILL','MARK'),
NAME2 = c('MARY','',''),
NAME3 = c('JAMES','',''),
ADRESS1 = c('NY','LONDON',''),
ADRESS2 = c('NY','TOKYO',''),
ADRESS3 = c('ROMA','',''),
COLOR1 = c('GREEN','BLUE','ORANGE'),
COLOR2 = c('RED','',''),
COLOR3 = c('YELLOW','',''),
CAR1 = c('','FORD','FERRARI'),
CAR2 = c('BMW','',''),
CAR3 = c('TRUCK','',''),
COUNTRY = c('USA','USA','USA'))
但是请注意,COUNTRY
变量不需要有很多列(COUNTRY1、COUNTRY2、COUNTRY3),因为结果是重复的。在我的原始文件中,我会发现无数这样的情况。
如何在df2中均匀排列数据?
也许我们可以使用 reshape
u <- reshape(
transform(
unique(df1),
GRP = ave(seq_along(ID), ID, FUN = seq_along)
),
direction = "wide",
idvar = "ID",
timevar = "GRP"
)
u[order(match(gsub("\.\d+", "", names(u)), names(df1)))]
这给出了
> u[order(match(gsub("\.\d+", "", names(u)), names(df1)))]
ID NAME.1 NAME.2 NAME.3 ADRESS.1 ADRESS.2 ADRESS.3 COLOR.1 COLOR.2 COLOR.3
1 111 JOHN MARY JAMES NY NY ROMA GREEN RED YELLOW
7 222 WILL WILL <NA> LONDON TOKYO <NA> BLUE BLUE <NA>
9 333 MARK <NA> <NA> <NA> <NA> ORANGE <NA> <NA>
CAR.1 CAR.2 CAR.3 COUNTRY.1 COUNTRY.2 COUNTRY.3
1 BMW TRUCK USA USA USA
7 FORD FORD <NA> USA USA <NA>
9 FERRARI <NA> <NA> USA <NA> <NA>
pivot_wider
library(dplyr)
library(tidyr)
library(data.table)
distinct(df1) %>%
mutate(rn = rowid(ID)) %>%
pivot_wider(names_from = rn, values_from = NAME:CAR,
names_sep = "", values_fill = "") %>%
select(-COUNTRY, COUNTRY)
-输出
# A tibble: 3 × 14
ID NAME1 NAME2 NAME3 ADRESS1 ADRESS2 ADRESS3 COLOR1 COLOR2 COLOR3 CAR1 CAR2 CAR3 COUNTRY
<dbl> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 111 JOHN "MARY" "JAMES" "NY" "NY" "ROMA" GREEN "RED" "YELLOW" "" "BMW" "TRUCK" USA
2 222 WILL "WILL" "" "LONDON" "TOKYO" "" BLUE "BLUE" "" "FORD" "FORD" "" USA
3 333 MARK "" "" "" "" "" ORANGE "" "" "FERRARI" "" "" USA