如何将每行可变数量的列转换为 R 中的多行?
How to convert a variable number of columns per row to multiple rows in R?
这是我的问题:我有一个单独的数据库:1 行 = 1 人。对于每个人,都有一个唯一的标识符(“INAMI_key”)、单独的变量(下例中的“code_qualif”)和一个或多个通过不同列填写的地址。地址数在“n_addresses”变量中表示:1个表示一个地址,2个表示两个地址,以此类推。不同的地址在变量“travail_ruex”(街道)和“travail_code_postalx"(邮政编码)。这是它的样子:
INAMI_key code_qualif n_adresses travail_rue1 travail_code_postal1 travail_rue2 travail_code_postal2 travail_rue3 travail_code_postal3 travail_rue4 travail_code_postal4 travail_rue5 travail_code_postal5
1 30000120 001 02 "RUE VAN ARTEVELDE " " 1000" "paul pastur " " 6180" "" "" "" "" "" ""
2 30000417 001 01 "av Margueritr depasse " " 6060" "" "" "" "" "" "" "" ""
3 37603435 007 01 "du Grand Veneur " " 1170" "" "" "" "" "" "" "" ""
4 38300152 007 02 "RUE WAUTERS 92-94" " 6040" "de châtelet " " 6120" "" "" "" "" "" ""
5 38707849 001 03 "de Campine " " 4000" "de Chestret " " 4000" "de la Gare " " 4020" "" "" "" ""
6 38813856 001 03 "Torhoutste steenweg " " 8400" "Lage Kaart " " 2930" "De Vrièrestraat " " 8301" "" "" "" ""
7 38811084 001 04 "chaussée de Waterloo " " 1180" "avenue Napoléon " " 1420" "rue Léon Théodor " " 1090" "avenue de la Basilique " " 1081" "" ""
8 39105054 001 04 "EMILE CLAUS " " 1050" "RUE DU FOYER SCHAERBEEKOIS " " 1030" "RUE XAVIER DE BUE " " 1180" "BV LAMBERMONT " " 1030" "" ""
9 39117031 001 05 "KERKSTRAAT " " 3850" "Pater Richard van de Wouwerstraa" " 3271" "Wilderenlaan " " 3803" "Gyzevennestraat " " 3560" "Molenveldstraat " " 3500"
10 31823918 070 05 "Route de l'Etat " " 1380" "Avenue Paul Hymans " " 1200" "Avenue WInston Churchill " " 1180" "Avenue Winston Churchill " " 1180" "avenue hippocrate " " 1200"
这里有在 R 中导入此示例的代码:
structure(list(INAMI_key = c("30000120", "30000417", "37603435",
"38300152", "38707849", "38813856", "38811084", "39105054", "39117031",
"31823918"), code_qualif = c("001", "001", "007", "007", "001",
"001", "001", "001", "001", "070"), n_adresses = c("02", "01",
"01", "02", "03", "03", "04", "04", "05", "05"), travail_rue1 = c("RUE VAN ARTEVELDE ",
"av Margueritr depasse ", "du Grand Veneur ",
"RUE WAUTERS 92-94", "de Campine ",
"Torhoutste steenweg ", "chaussée de Waterloo ",
"EMILE CLAUS ", "KERKSTRAAT ",
"Route de l'Etat "), travail_code_postal1 = c(" 1000",
" 6060", " 1170", " 6040", " 4000", " 8400", " 1180", " 1050",
" 3850", " 1380"), travail_rue2 = c("paul pastur ",
"", "", "de châtelet ", "de Chestret ",
"Lage Kaart ", "avenue Napoléon ",
"RUE DU FOYER SCHAERBEEKOIS ", "Pater Richard van de Wouwerstraa",
"Avenue Paul Hymans "), travail_code_postal2 = c(" 6180",
"", "", " 6120", " 4000", " 2930", " 1420", " 1030", " 3271",
" 1200"), travail_rue3 = c("", "", "", "", "de la Gare ",
"De Vrièrestraat ", "rue Léon Théodor ",
"RUE XAVIER DE BUE ", "Wilderenlaan ",
"Avenue WInston Churchill "), travail_code_postal3 = c("",
"", "", "", " 4020", " 8301", " 1090", " 1180", " 3803", " 1180"
), travail_rue4 = c("", "", "", "", "", "", "avenue de la Basilique ",
"BV LAMBERMONT ", "Gyzevennestraat ",
"Avenue Winston Churchill "), travail_code_postal4 = c("",
"", "", "", "", "", " 1081", " 1030", " 3560", " 1180"), travail_rue5 = c("",
"", "", "", "", "", "", "", "Molenveldstraat ",
"avenue hippocrate "), travail_code_postal5 = c("",
"", "", "", "", "", "", "", " 3500", " 1200")), row.names = c(NA,
-10L), class = c("tbl_df", "tbl", "data.frame"))
我想做的是将每个人的行数相乘,以在不同的行上显示不同的地址,但在相同的字段中。例如,如果一个人有 3 个地址,则为同一个人创建 3 行,保留个人变量,但将地址重新组织到具有相同名称的列中:“travail_rue_total”和“travail_code_postal_total”在下面的例子中。如果个人有 1 个地址,则创建 1 行,如果他有 5 个地址,则创建 5 行,依此类推。:
INAMI_key code_qualif n_adresses travail_rue_total travail_code_postal_total
1 30000120 1 2 RUE VAN ARTEVELDE 1000
2 30000120 1 2 paul pastur 6180
3 30000417 1 1 av Margueritr depasse 6060
4 37603435 7 1 du Grand Veneur 1170
5 38300152 7 2 RUE WAUTERS 92-94 6040
6 38300152 7 2 de châtelet 6120
7 38707849 1 3 de Campine 4000
8 38707849 1 3 de Chestret 4000
9 38707849 1 3 de la Gare 4020
10 38813856 1 3 Torhoutste steenweg 8400
11 38813856 1 3 Lage Kaart 2930
12 38813856 1 3 De Vrièrestraat 8301
13 38811084 1 4 chaussée de Waterloo 1180
14 38811084 1 4 avenue Napoléon 1420
15 38811084 1 4 rue Léon Théodor 1090
16 38811084 1 4 avenue de la Basilique 1081
17 39105054 1 4 EMILE CLAUS 1050
18 39105054 1 4 RUE DU FOYER SCHAERBEEKOIS 1030
19 39105054 1 4 RUE XAVIER DE BUE 1180
20 39105054 1 4 BV LAMBERMONT 1030
21 39117031 1 5 KERKSTRAAT 3850
22 39117031 1 5 Pater Richard van de Wouwerstraa 3271
23 39117031 1 5 Wilderenlaan 3803
24 39117031 1 5 Gyzevennestraat 3560
25 39117031 1 5 Molenveldstraat 3500
26 31823918 70 5 Route de l'Etat 1380
27 31823918 70 5 Avenue Paul Hymans 1200
28 31823918 70 5 Avenue WInston Churchill 1180
29 31823918 70 5 Avenue Winston Churchill 1180
30 31823918 70 5 avenue hippocrate 1200
这是数据的简化版本。在整个数据库中,我有 40 个地址 x 15 个变量(街道、号码、城市、邮政编码、机构...)。
谢谢!
您可以使用 pivot_longer
获取长格式数据,使用 filter
删除空值。
library(dplyr)
library(tidyr)
df %>%
pivot_longer(cols = starts_with('travail'),
names_to = '.value',
names_pattern = 'travail_(.*?)\d+') %>%
filter(rue != '')
# INAMI_key code_qualif n_adresses rue code_postal
# <chr> <chr> <chr> <chr> <chr>
# 1 30000120 001 02 "RUE VAN ARTEVELDE " " 1000"
# 2 30000120 001 02 "paul pastur " " 6180"
# 3 30000417 001 01 "av Margueritr depasse " " 6060"
# 4 37603435 007 01 "du Grand Veneur " " 1170"
# 5 38300152 007 02 "RUE WAUTERS 92-94" " 6040"
# 6 38300152 007 02 "de châtelet " " 6120"
# 7 38707849 001 03 "de Campine " " 4000"
# 8 38707849 001 03 "de Chestret " " 4000"
# 9 38707849 001 03 "de la Gare " " 4020"
#10 38813856 001 03 "Torhoutste steenweg " " 8400"
# … with 20 more rows
这是我的问题:我有一个单独的数据库:1 行 = 1 人。对于每个人,都有一个唯一的标识符(“INAMI_key”)、单独的变量(下例中的“code_qualif”)和一个或多个通过不同列填写的地址。地址数在“n_addresses”变量中表示:1个表示一个地址,2个表示两个地址,以此类推。不同的地址在变量“travail_ruex”(街道)和“travail_code_postalx"(邮政编码)。这是它的样子:
INAMI_key code_qualif n_adresses travail_rue1 travail_code_postal1 travail_rue2 travail_code_postal2 travail_rue3 travail_code_postal3 travail_rue4 travail_code_postal4 travail_rue5 travail_code_postal5
1 30000120 001 02 "RUE VAN ARTEVELDE " " 1000" "paul pastur " " 6180" "" "" "" "" "" ""
2 30000417 001 01 "av Margueritr depasse " " 6060" "" "" "" "" "" "" "" ""
3 37603435 007 01 "du Grand Veneur " " 1170" "" "" "" "" "" "" "" ""
4 38300152 007 02 "RUE WAUTERS 92-94" " 6040" "de châtelet " " 6120" "" "" "" "" "" ""
5 38707849 001 03 "de Campine " " 4000" "de Chestret " " 4000" "de la Gare " " 4020" "" "" "" ""
6 38813856 001 03 "Torhoutste steenweg " " 8400" "Lage Kaart " " 2930" "De Vrièrestraat " " 8301" "" "" "" ""
7 38811084 001 04 "chaussée de Waterloo " " 1180" "avenue Napoléon " " 1420" "rue Léon Théodor " " 1090" "avenue de la Basilique " " 1081" "" ""
8 39105054 001 04 "EMILE CLAUS " " 1050" "RUE DU FOYER SCHAERBEEKOIS " " 1030" "RUE XAVIER DE BUE " " 1180" "BV LAMBERMONT " " 1030" "" ""
9 39117031 001 05 "KERKSTRAAT " " 3850" "Pater Richard van de Wouwerstraa" " 3271" "Wilderenlaan " " 3803" "Gyzevennestraat " " 3560" "Molenveldstraat " " 3500"
10 31823918 070 05 "Route de l'Etat " " 1380" "Avenue Paul Hymans " " 1200" "Avenue WInston Churchill " " 1180" "Avenue Winston Churchill " " 1180" "avenue hippocrate " " 1200"
这里有在 R 中导入此示例的代码:
structure(list(INAMI_key = c("30000120", "30000417", "37603435",
"38300152", "38707849", "38813856", "38811084", "39105054", "39117031",
"31823918"), code_qualif = c("001", "001", "007", "007", "001",
"001", "001", "001", "001", "070"), n_adresses = c("02", "01",
"01", "02", "03", "03", "04", "04", "05", "05"), travail_rue1 = c("RUE VAN ARTEVELDE ",
"av Margueritr depasse ", "du Grand Veneur ",
"RUE WAUTERS 92-94", "de Campine ",
"Torhoutste steenweg ", "chaussée de Waterloo ",
"EMILE CLAUS ", "KERKSTRAAT ",
"Route de l'Etat "), travail_code_postal1 = c(" 1000",
" 6060", " 1170", " 6040", " 4000", " 8400", " 1180", " 1050",
" 3850", " 1380"), travail_rue2 = c("paul pastur ",
"", "", "de châtelet ", "de Chestret ",
"Lage Kaart ", "avenue Napoléon ",
"RUE DU FOYER SCHAERBEEKOIS ", "Pater Richard van de Wouwerstraa",
"Avenue Paul Hymans "), travail_code_postal2 = c(" 6180",
"", "", " 6120", " 4000", " 2930", " 1420", " 1030", " 3271",
" 1200"), travail_rue3 = c("", "", "", "", "de la Gare ",
"De Vrièrestraat ", "rue Léon Théodor ",
"RUE XAVIER DE BUE ", "Wilderenlaan ",
"Avenue WInston Churchill "), travail_code_postal3 = c("",
"", "", "", " 4020", " 8301", " 1090", " 1180", " 3803", " 1180"
), travail_rue4 = c("", "", "", "", "", "", "avenue de la Basilique ",
"BV LAMBERMONT ", "Gyzevennestraat ",
"Avenue Winston Churchill "), travail_code_postal4 = c("",
"", "", "", "", "", " 1081", " 1030", " 3560", " 1180"), travail_rue5 = c("",
"", "", "", "", "", "", "", "Molenveldstraat ",
"avenue hippocrate "), travail_code_postal5 = c("",
"", "", "", "", "", "", "", " 3500", " 1200")), row.names = c(NA,
-10L), class = c("tbl_df", "tbl", "data.frame"))
我想做的是将每个人的行数相乘,以在不同的行上显示不同的地址,但在相同的字段中。例如,如果一个人有 3 个地址,则为同一个人创建 3 行,保留个人变量,但将地址重新组织到具有相同名称的列中:“travail_rue_total”和“travail_code_postal_total”在下面的例子中。如果个人有 1 个地址,则创建 1 行,如果他有 5 个地址,则创建 5 行,依此类推。:
INAMI_key code_qualif n_adresses travail_rue_total travail_code_postal_total
1 30000120 1 2 RUE VAN ARTEVELDE 1000
2 30000120 1 2 paul pastur 6180
3 30000417 1 1 av Margueritr depasse 6060
4 37603435 7 1 du Grand Veneur 1170
5 38300152 7 2 RUE WAUTERS 92-94 6040
6 38300152 7 2 de châtelet 6120
7 38707849 1 3 de Campine 4000
8 38707849 1 3 de Chestret 4000
9 38707849 1 3 de la Gare 4020
10 38813856 1 3 Torhoutste steenweg 8400
11 38813856 1 3 Lage Kaart 2930
12 38813856 1 3 De Vrièrestraat 8301
13 38811084 1 4 chaussée de Waterloo 1180
14 38811084 1 4 avenue Napoléon 1420
15 38811084 1 4 rue Léon Théodor 1090
16 38811084 1 4 avenue de la Basilique 1081
17 39105054 1 4 EMILE CLAUS 1050
18 39105054 1 4 RUE DU FOYER SCHAERBEEKOIS 1030
19 39105054 1 4 RUE XAVIER DE BUE 1180
20 39105054 1 4 BV LAMBERMONT 1030
21 39117031 1 5 KERKSTRAAT 3850
22 39117031 1 5 Pater Richard van de Wouwerstraa 3271
23 39117031 1 5 Wilderenlaan 3803
24 39117031 1 5 Gyzevennestraat 3560
25 39117031 1 5 Molenveldstraat 3500
26 31823918 70 5 Route de l'Etat 1380
27 31823918 70 5 Avenue Paul Hymans 1200
28 31823918 70 5 Avenue WInston Churchill 1180
29 31823918 70 5 Avenue Winston Churchill 1180
30 31823918 70 5 avenue hippocrate 1200
这是数据的简化版本。在整个数据库中,我有 40 个地址 x 15 个变量(街道、号码、城市、邮政编码、机构...)。
谢谢!
您可以使用 pivot_longer
获取长格式数据,使用 filter
删除空值。
library(dplyr)
library(tidyr)
df %>%
pivot_longer(cols = starts_with('travail'),
names_to = '.value',
names_pattern = 'travail_(.*?)\d+') %>%
filter(rue != '')
# INAMI_key code_qualif n_adresses rue code_postal
# <chr> <chr> <chr> <chr> <chr>
# 1 30000120 001 02 "RUE VAN ARTEVELDE " " 1000"
# 2 30000120 001 02 "paul pastur " " 6180"
# 3 30000417 001 01 "av Margueritr depasse " " 6060"
# 4 37603435 007 01 "du Grand Veneur " " 1170"
# 5 38300152 007 02 "RUE WAUTERS 92-94" " 6040"
# 6 38300152 007 02 "de châtelet " " 6120"
# 7 38707849 001 03 "de Campine " " 4000"
# 8 38707849 001 03 "de Chestret " " 4000"
# 9 38707849 001 03 "de la Gare " " 4020"
#10 38813856 001 03 "Torhoutste steenweg " " 8400"
# … with 20 more rows